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THE READING PROBLEM IN ARITHMETIC 


PAUL W. TERRY 
University of Washington 


DESCRIPTION OF THE INVESTIGATION 


It is rapidly becoming a matter of general recognition among 
school people that a definite and distinctive problem in reading is to be 
found in each of the subjects of the elementary school program of 
studies. The subject of arithmetic, which offers for reading such 
characteristic materials as verbal problems and numerals which are 
not set in problems, is not exceptional in this regard. In view of this 
fact, teachers of arithmetic are confronted with the necessity of 
developing a specialized technique for the reading of arithmetical 
materials. Modern scientific studies in the psychology of reading 
have given but scant attention to the reading of numerals. Such 
interest in numerals, as has been manifested, has proceeded from 
the point of view of pure science rather than from that of practical 
educational applications. Little is known, therefore, of the methods 
employed by children in the gradual acquirement of the power of 
_ reading numerals. A simple approach to this general problem may be 
found in the study of the methods used by adults in this process. It is 
the purpose of this article to present a few of the more important results 
of such an investigation.! 

The subjects who served in the several studies of the investigation 
were all graduate students of the School of Education of the University 
of Chicago. In each study from four to ten subjects were used. With 
only two exceptions their experiences with numerals had been entirely 
normal for adults of similar training. The instructions which were 


1 For a complete discussion of the investigation see-Terry, P. W.: An Ezperi- 
mental Study of the Reading of Isolated Numerals and of Numerals in Arithmetic 
Problems, Supp. Ed. Mons. No. 18, Dept. of Ed., Univ. of Chicago. 
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given the subjects were designed to secure normal procedures on their 
part. They were asked to entertain their usual problem-solving 
attitudes and to proceed at normal speed. The problems were to be 
read and solved. In every case, opportunity was given to the subject 
before he began his work to examine materials similar to those to be 
read or solved, and under similar conditions. 

The materials which they read were of two different kinds. First, 
there were ordinary arithmetical problems which were so designed as 
to include the desired numbers of the numerals that were to be studied. 
Some of the problems included two numerals; others as many as four. 
The lengths of the numerals varied from one to seven digits and 
numerals of similar lengths only were placed in any one problem. 
Problems A and B will serve as illustrations of the problem type of 
material. 

PROBLEM A 


At 47 cents a dozen what will 2 dozen eggs cost? 


PROBLEM B 


If one telephone company uses 1,918,564 cross-bars during the 
year, and another company, in the same period, uses 617,453 cross-bars, 
how many more does the one use than the other? 

In the second place, there were ordinary numerals which were 
placedin lines and isolated from each other and from any other context. 
These numerals were from one to seven digits in length and included 
a representative number of each length. 

The methods which were followed in procuring the data were three 
in number. In the first place, the adult subjects of the investigation 
were asked to make introspective observations of their procedures with 
the numerals. This they were easily able to do after the brief periods 
of training which were given. In the second place, the introspective 
reports of the subjects were supplemented by direct observations of the 
readings which were made and recorded by the author of this report. 
The information gained by the use of these two methods served as a 
basis of interpretation of the data secured by the third and more 
objective method of photographing the movements of the eyes while 
reading. The eye-movement photographic apparatus was that which 
is described by C. T. Gray in his monograph, Types of Reading Ability 
as Exhibited through Tests and Laboratory Experiments.! For the 
purposes of this article it will be sufficient to say that by the use 


1 Supp. Ed. Mons. V. I No. 5 : 83-90. 
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of this apparatus records were obtained which could be so interpreted 
as to show what words or numerals (or digits of numerals) were the 
objects of the reader’s attention at any fixation of the eye. At the 
same time accurate records of the duration of fixations were secured. 


Tue First READING PHASE 


The first important phenomenon, which came in evidence fre- 
quently during the course of the investigation, was the fact that two 
separate phases are distinguishable in the reading of arithmetical 
problems. ‘These phases differ both in time and in purpose and may be 
designated as the first reading and rereading phases respectively. 
When both phases were found in the reading of a problem, the general 
procedure on the part of the subject was first to read the text through 
with more or less attention to the numerals in order “to get the sense”’ 
of the problem (first reading). When this was done his attention was 
directed for the second time to the numerals of the text with the 
intention of perceiving them accurately and completely (rereading). 

During the first reading of a problem the numerals are read with 
widely varying degrees of attention. In some cases, subjects attended 
with care to the identity and place of each component digit of the 
numeral, at the same time noted its character as whole or decimal and 
gained an accurate notion of its magnitude. Such detailed perception 
of a numeral is designated as whole first reading. In many cases, on the 
other hand, numerals were not read in this detail. At times the 
numeral was merely recognized as a numeral. In other cases the item 
of digit length, or digit length and identity of the first digit only, 
were observed. The first two digits only and digit length were also 
observed in some readings, and various other items (such as the fact 
that one numeral was larger than another) were reported by subjects. 
In all of these instances the perception of the numeral is indistinct 
and lacking in detail. Such reading of numerals is called partial first 
reading. 

Important conclusions as to the significance of partial reading can 
be drawn from a study of the frequency with which the subjects 
observed such details of the numerals as are listed above. In Table I 
is presented under five classifications the range of correct recall of the 
numerals which a group of subjects was able to report immediately 
after the first reading of a set of problems. No attempt has been made 
to separate the results of partial reading from those of whole reading. 
Conclusions will be drawn concerning only those ranges of recall which 
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were reported in a decisively preponderant number of cases. Such 
ranges of recall are obviously characteristic results of partial reading 
as well as of whole reading. The facts which stand out most strikingly 
from Table I are that almost without exception the numerals were at 
least partially read, and that in nearly every instance their digit length 
was noted. The identity of the first digit also was so frequently recalled 
as to be classified as an equally characteristic result of partial reading. 

Such details of the numerals apparently are of the same value to the 
subject as the general conditions of the problem. In consequence, 
these preliminary perceptions of the numerals are obtained con- 
temporaneously with recognition of the general conditions. The sig- 
nificance of partial first reading, therefore, appears to lie in the fact 
tnat it enables the subject to think about the problem without entering 
at first into the minute details of solution. 

The number of pauses of the eye and the total time required for the 
reading of any longer numeral vary, according to the type of first 
reading which is used. When the numeral 1,918,564 was read in the 
partial fashion by Subject G only two pauses of 9 and 18 fiftieths of a 
second respectively were recorded. When the same numeral was 
read in detail by Subject B, however, five pauses of 8, 9, 20, 18 and 36 
fiftieths of a second respectively were required. Inthe same manner, 
the number of digits, which are read during one pause of the eye in 
partial reading is decidedly greater than the number read during 
a pause of whole reading. In the former case for the most part from 
three to four digits are read per pause, whereas in the latter, one and 
two digits are more frequently included in one pause. 

The percent of partial and of whole readings which a numeral 
received from a group of subjects was found to vary with its length, 
the number of other numerals in the problem, and with its position in 
the problem. Longer numerals varying in length from four to six 
digits were read partially in decidedly more than half of the cases 
whereas exactly the reverse is true for the shorter numerals of one 
and two digits. Significantly larger percents of numerals of a given 
length were read partially when as many as four numerals appeared in 
the context of a problem than when only two numerals of the same 
length appeared. The first longer numeral to appear in a problem in 
which several longer numerals were included, received a larger number 
of detailed readings than the other numerals. — 

Another factor which determines whether a numeral will be read 
partially or in detail is found in the attitudes of the individual subjects 
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toward the numerals. The shorter numerals of one and two digits are 
read in detail almost invariably by all subjects. To the longer 
numerals, however, some of the subjects give partial readings with 
remarkable consistency; whereas other subjects persistently read the 
same numerals in detail. The former may thus be classified as partial 
first readers of numerals and the latter as whole first readers. A clear 
illustration of the difference in the readings given to a set of numerals 
by partial first readers and by whole first readers is found in a com- 
parison of the records of subjects G and H, each of whom read a certain 
set of seven problems. ‘There were nineteen numeralsin theset. Six 
of the numerals were one or two digits in length and thirteen were from 
three toseven digitsin length. Subject G, a marked partial first reader, 
gave partial readings to fourteen of the numerals, while Subject H, 
a consistent whole first reader, read all of the numerals in detail. 

Attention has been called in paragraphs above to the fact that 
numerals in problems are attacked by partial first readers with atti- 
tudes which differ distinctly from the attitudes of whole readers. 
The very fact that attitudes towards the numerals do vary so widely 
suggests that the numerals stand out from the other contextual ele- 
ments of problems in some distinct way. An experiment was arranged, 
therefore, with a view to determining in what way the numerals and 
words, which constitute the text of an arithmetical problem, differ in 
their demands upon the attention of readers. The records from this 
experiment show that the numerals of problems make decidedly 
greater demands upon attention than the accompanying words. Less 
than half as many digits as letters is perceived during one pause 
of the eye. The average duration of pauses on numerals was found to 
be approximately 40 per cent greater than for pauses on words. The 
per cent of regressive pauses in the total number of pauses, is much 
greater for numerals than for words. In respect to each of these three 
items the numerals make greater exactions of the readers, and this is 
true in spite of the fact that many of the numerals were read in the 
cursory partial fashion. The explanation of this contrast probably 
lies in the differences between numerals and words in point of con- 
struction. The letters in words appear and reappear in the same 
regular combinations which become familiar in the earlier years of 
schooling and are read as wholes. The digits in numerals, however, 
appear in constantly changing combinations. Each individual digit 
is significant in itself and all of the digits must be perceived before 
the numeral is accurately read. It is obvious, therefore, that the 
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mind is more occupied with processes of analysis and combination of 
component digits when numerals are being read than with similar 
processes when words are read. 


COMPARISON OF PARTIAL AND WHOLE First READING 


The above discussion provides ample evidence for the conclusion 
that the numerals are a unique feature of the reading situation which 
is involved in the reading of arithmetical problems. Since two 
distinctly different methods have been developed for reading the 
numerals, the question necessarily arises: which is the more efficient? 
The answer depends in large measure upon the subsequent question, 
which of the two methods is more economical of the reader’s time? 
Attention should be called at this point to the fact that the type of 
activity which was displayed by the subject after the first reading in 
his efforts to solve a problem did not appear to depend upon which 
method of attending to the numerals was followed during the first 
reading. Under these conditions, the first reading may be treated as a 
relatively independent phase of the solving of a problem and likewise 
the question of the comparative efficiency of the two methods of first 
reading may be considered independently of activities subsequent to 
the first reading. 


TABLE II 
Comparison between partial and whole first readers of numerals in respect to 
rates of reading 


(Time unit = Mo seconds) 





Partial readers Whole readers 











Subjects , 
G | M | Ww isB H Hb 
Total time required to read the numerals 
of the five problems!................. 195.0/304.0/245.0/480.0; 359.0/337.0 
Total time required to read the words of 
of the five problems!................. 708.0800 .0|925 .0|560.0) 1,092 .0/923.0 
Average reading time per line with ordi- 
d+ < 0c heehee caiew ds bck on.es « 52.5) 44.9) 72.2) 80.8) 75.0 























1A set of five ordinary arithmetical problems including 12 numerals and 
111 words, which were read before the photographic apparatus. ‘ 

* An ordinary expository prose passage of 10 lines from Judd’s Psychology of 
High School Subjects, p. 190. 
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With this question in view data have been arranged in Table II 
from the readings of five ordinary arithmetical problems and a selection 
of ordinary prose by six subjects. Three of the subjects used the par- 
tial method and three the whole method of reading the numerals. 
By inspection of the first row of the table it is readily seen that the 
three partial readers, (G, M and W) read the numerals more rapidly 
than the three whole readers. The case for the words of the problems 
is not so clear, although the second and third fastest subjects were the 
partial readers, G, and M, and the slowest subject was the whole 
reader H. With the selection of ordinary expository prose, the fastest 
reading was clearly done by the subjects who used the partial method 
with the numerals in the problems. 

Partial first reading of numerals is essentially a method by which a 
smaller number of pauses is used than in reading in detail. The 
greater speed of the partial readers with the ordinary prose selection 
also was due primarily to the use of a smaller number of pauses. Ap- 
parently, the virtue of partial reading of numerals and the greater 
rapidity of the partial readers with problems and with other materials 
as well is due to the fact that the same quantity of reading material 
is read with fewer pauses of the eye. In point of time it is the more 
economical method and therefore as far as the materials and subjects 
of this investigation afford a basis for conclusion, partial reading 
appears to be the more efficient of the two methods. 


REREADING 


The second phase of the reading of a problem—the rereading—, 
follows immediately upon the completion of the first reading. The 
rereading is concerned almost invariably with numerals only, although 
instances were recorded when words also were reread. Numerals of 
from one to seven digits can be read by one act of rereading. Numer- 
als one to four digits in length were invariably reread at one reading. 
Six and seven digit numerals in several instances required two acts of 
rereading. In such cases the first several digits were read as a group 
and copied on paper, whereupon immediately the remaining digits were 
read and copied. Numerals of greater digit lengths required greater 
amounts of time for rereading than did shorter numerals. The total 
time required for the rereading of a problem, as would be expected, 
was decidedly less than that required for the first reading. 

Two types of rereading were found which are clearly distinguished 
by differences in both function and procedure. The first type, which 
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may be described as simple rereading, has for its function the securing 
of further information concerning a numeral before a decision has 
been reached as to how to proceed with solving the problem. Only 
one of the numerals of a problem is selected for this type of rereading, 
for the most part. The object of the reader is very specific. He looks 
for such detailed items as the number of digits, the identity of certain 
digits and the location of the numeral in the line of print. The second 
type of rereading has for its function, apparently, a careful inspection 
of the numerals as a step preliminary to copying them. After each 
case of rereading for copying, the records show that the subject im- 
mediately proceeded to copy the numeral on the paper upon which 
computation was to take place. It thus appears that whenever this 
type of rereading was used the subject had already advanced with the 


solving of the problem to the point of deciding that the numerals 
should be copied. 


REREADING PROCEDURES DuRING COMPUTATION 


It was found impossible to interpret precisely the photographic 
records of the eye-movements of subjects while they were engaged in 
computing from copied numerals. Several of the subjects, however, 
exhibited another method of computing, the records of which were 
more susceptible of accurate interpretation. In this case the procedure 
of the subjects was to treat directly with the numerals as they appeared 
in the context of the printed problem and to compute “mentally”’ 
without copying the numerals. This method, which is called direct 
computation, was found practicable under widely varying conditions. 
It was used by both partial and whole first readers, with both longer 
and shorter numerals, and after partial as well as whole first reading of 
numerals. 

It is obvious that certain ways of reading and solving arithmetical 
problems which have been described in this investigation are more 
economical than others. When the method of direct computation is 
used, not only is the laborious step of copying the numerals saved but 
also the additional step of rereading, for only in rare instances were 
numerals reread which were not to be copied. It is important to recall 
at this point that direct computation without rereading was practiced 
even when the numerals had been read only partially during the first 
reading. The records furnish conclusive testimony to the striking fact 
that only such details of numerals as can be perceived with the rapid 
and cursory attention which is given during partial reading, are re- 
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quired as a basis for actual computation. In cases of this kind the 
numerals were not read in complete detail until they were attacked in 
the process of computation itself. The conclusion may be drawn, 
therefore, in so far as this investigation is concerned that the procedure 
of direct computation based upon a partial first reading of the numer- 
als is the most economical way of reading and solving arithmetical 
problems. 

During the process of ‘‘mental” computation direct from the 
printed lines, the numerals which appeared in a problem were not 
treated with equal attention. This fact was observed in the records 
of only those problems which contained exactly two numerals. In 
several such instances one numeral was taken as the “base of opera- 
tions.”” When a numeral was so used its digits served as the starting 
point of the computation and the digits of the second numeral were 
related to it. More pauses of the eye and pauses of greater average 
duration were located on the digits of the “base” numeral than upon 
the digits of the second numeral. 

The numeral of greater digit length was selected almost without 
exception as the “base of operations.”’ Such a selection is in keeping 
with the common practice of placing the numeral of greater magnitude 
first in the order of computation and relating the smaller numeral to it. 
The larger number of pauses of the eye on the “base” numeral is 
probably due to the fact that this numeral presents one digit more for 
reading than the second numeral, and to the further fact that com- 
putation both begins and ends on the ‘‘base” numeral. The fact that 
the eye lingered for longer pauses on the ‘‘base’”’ numeral suggests 
that a peculiar quality of work was done during the pauses on this 
numeral. It is the author’s opinion that when the eyes of the subject 
were engaged with the digits of the second numeral, the work done was 
in the main simply that of recognizing the digits which were being read. 
On the other hand, the work which was accomplished during the pauses 
on the “base” numeral included not only recognition of the digits 
which were being read but also processes of a more definitely arith- 
metical quality. For such work it appears reasonably certain that 
additional time would be required. 


ISOLATED NUMERALS 


The isolated numerals were arranged for reading in two different 
ways. In one section of the investigation they were placed in col- 
umns, one numeral only to a line. In a second section they were 
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placed in regular printed lines and at such distance within the line 
from each other that the reading of one numeral did not interfere 
with the reading of any other numeral. The space between the lines 
of numerals in both sections of the investigation was arranged with a 
view to preventing the subjects from being occupied with numerals 
in more than one line at a time. The instructions to subjects called 
for a low and easy articulation of the numerals as they were being read. 
The provision for articulation enabled the author to hear and record 
the grouping of the digits of the numerals and the numerical language 
which was used. In the second section of the investigation photo- 
graphic records were made of the movements of the eyes of the subjects 
while they read the numerals. The data which were procured by this 
method supplemented those which were obtained by recording the 
subject’s articulations. 

One of the most striking features of the reading of numerals is the 
fact that the subjects habitually divided the numerals into digit 
groups. When this is done certain successive digits are so closely 
associated with each other in the reading as to form units of reading, 
which units are at the same time distinguished from other similar 
units. The digits which constitute a group are bound together by 
being pronounced in quick succession as one series. The pronuncia- 
tions of the several digit groups are separated from each other by time 
intervals distinctly longer than the intervals which separate the 
pronunciations of the individual digits. Three different sizes of groups 
were clearly distinguished in the readings, namely, those that were 
made up of one, two and three digits respectively. The one digit 
groups appeared more frequently in the one, three and seven digit 
numerals. The two and three digit groups appeared in the readings of 
numerals of all the greater digit lengths. 

It became apparent early in the course of the investigation that the 
digits of numerals of the same lengths were being grouped in much 
the same manner by all of the subjects. The outstanding fact is that 
the digits of numerals of any particular length are divided into a 
certain number of groups made up of a certain number of digits, which 
groups stand in a certain order of succession. Such an arrangement 
of the digits of a numeral is designated as a main group pattern. The 
one and two digit numerals were each read as single groups of one and 
two digits respectively. The first variation from one main group 
pattern as representative of the reading of numerals of the same length, 
occurs in the three digit numerals which exhibit two patterns. The 
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four digit numerals appear almost invariably in a pattern of two groups 
of two digits each. The five digit numerals show in a preponderant 
number of cases a pattern of two groups of two and three digits 
respectively, and the dominant pattern of the six digit numerals is that 
of two groups of three digits each, while a three group pattern of one, 
three and three digits respectively was used for seven digit numerals. 

Approximately one-half of the five to seven digit isolated numerals 
in one section of the investigation was not written with the usual 
comma punctuation. Opportunity was afforded in this way of study- 
ing the effect of punctuation on the grouping of the digits of the 
longer numerals. Punctuation apparently operates in decisive fashion 
to increase the number of three digit groups used, and conversely to 
decrease the number of groups of two digits. Fewer main group pat- 
terns appear in the readings of the punctuated numerals. By thus 
regenerating the main group patterns, and in encouraging the use of 
the larger group of three digits, the employment of punctuation appears 
to give greater facility and speed to the reading of numerals. 

When a detailed examination is made of the eye movements with 
which the numerals were read, it appears that two distinct types of 
pauses are represented. Pauses of the first type, which may be called 
strictly reading pauses, were probably used in recognizing the identity 
of the digits of the numerals and the relations between the digits. 
Such pauses are invariably located on the numerals and their durations 
are approximately equal to, or greater than the average duration of the 
pauses of the subject whose records are under consideration. A 
preponderant number of the pauses of any subject are of this first 
type. Pauses of the second type, which are called guiding pauses, 
were probably used in locating the first digits or the last digits of the 
numerals. They are found on the initial or final digits and more 
frequently on numerals of greater lengths. Some of these pauses 
appear on the lines between two numerals. Their duration is very 
brief as compared with that of the other type of pauses. 

The number of pauses and the total time, which is required to read 
a numeral, depend on the digit length of the numeral. The average 
total reading time per numeral increases steadily from the 21.45 
fiftieths seconds average for the one-digit numerals to the 104.54 
fiftieths seconds average for numerals of seven digits. Likewise 
the average number of pauses per numeral increases steadily. Num- 
erals of the same length exhibit a notable consistency in the number of 
pauses with which they were read. The average duration of the 
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pauses, on the other hand, does not depend upon the length of the 
numerals in the same consistent fashion. 

The subjects did not uniformly follow one single style of attack as 
they read those isolated numerals which were arranged several to a 
line. Two radically different modes of perceiving the numerals were 
displayed. By one method a relatively large number of pauses of 
relatively short average duration were used. A significant proportion 
of these pauses were of the guiding type. By the other method rela- 
tively few pauses of relatively long average duration were employed. 
The shortest total reading times for the whole set of isolated numerals 
were found in the records of the two subjects who used the many-short- 
pauses method of attack. 
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AN EXPERIMENTAL AND STATISTICAL STUDY OF 
READING AND READING TESTS 
(Continued) 


ARTHUR I. GATES 
Teachers College, Columbia University 


THe THORNDIKE AND THORNDIKE-McCALL SCALE FOR THE UNDER- 
STANDING OF SENTENCES 


Thorndike and McCall have devised 10 new forms, identical in 
principle with the test reported by Thorndike in 1915.! In the latest 
editions the test consists of 11 paragraphs, increasing in difficulty, 
which are to be read for the purpose of answering three or four ques- 
tions which follow each paragraph. The number of questions cor- 
rectly answered in 30 minutes may be transmuted into a scale score, 
in terms of which norms for ages and grades are given. Adjustment 
to the work in the test is secured by a fore-exercise. This test is of the 
sort frequently called a “‘ power”’ test: it aims to discover how difficult 
material the subject can comprehend. 

The original Thorndike Alpha and forms 1 and 2 of the McCall 
revision were given to all grades. Forms 1, 2, 3, 4 and 5 of the latter 


were given on successive days for experimental purposes to grades 
IV and VI.? 


The Difficulty of Forms 1, 2, 3, 4 and 5, Thorndike-McCall and the 
Consistency of Performance in the Several Forms 


Forms 1 to 5 which were given on successive days to the pupils of 
grades IV and VI appear in Table III. 














TABLE III 
Forms 1 2 3 4 | 5 | Average} S.D. 
Mean | 49.7 | 50.3 | 48.0| 45.0 | 50.3| 48.6 | 1.9 
Grade 4........ S.D. 3.41 461 5.2| 4.5] 5.5| 3.3 
Mean | 56.5 | 56.7/ 61.1 | 58.4 | 59.6] 58.3 | 1.8 
Grade 6........ S.D. 4.01 6.7| 5.9| 5.6] 5.6| 4.7 





























The several forms appear to be of approximately equal difficulty. 
The difference between scores 56 and 61 is the difference between 


1 Teachers College Record, November, 1915 and January, 1916. 
2 We are indebted to Miss Lulu Ailes and Miss Bess Young, teachers of grade IV 
and VI respectively for giving these tests. 
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answering 27 and 29 questions, which is not excessive. The order 
from easy to difficult forms differs for the two grades, the correla- 
tion between the two orders being in fact—0.10. This may well be 
expected, for in this test the upper limits of comprehension are at 
different levels. This fact makes it very difficult in tests of this type 
to secure consistency in performances for the individual child even if 
mean scores for a grade are equal, for the reason that the material in 
the vital area—at the mean limit of comprehension—(questions 
22-24 for our grade IV) may vary in difficulty for individuals because 
of the particular content. The correlations of single tests are likely, 
then, to vary widely. 

Table IV gives the correlations between each single test and the 


composite of all five tests, as well as a few correlations between the 
different forms. 
































TABLE IV 
Form 2 3 4 | 5 Composite 
: GRE eR 
Grade IV | VI | IV | VI | IV | VI | IV |} VI {| IV | VI 
| | 
es me lee oh ee ee Ae 
Form 2........... ot ..../ 0.721 0.56 ....1....1....1 ....1 0.771 0.75 
Form 3.. deel bs ¥ele wadel haul ie Wek ee GE, cde » >t eee 
a, eee seco] cose] oee8] see. ----] ..--| 0.06] 0.55] 0.45) 0.79 
Form 5.. vw as Ree eee 0.73) 0.83 
ee ces] oeuc] wens] Sess] coos] coos] osest sees 0.65) 0.78 
| 
































As expected, the inter-correlations of single tests vary widely but 
the correlations with the composite are, of course, much higher. All of 
these correlations are, however, very misleading unless the composition 
of our groups is considered; they represent a very narrow selection of 
the school universe in terms of performance in this test. The mean 
score for grade IV is 48.6 with an S.D. of 3.3: for grade VI, the mean is 
58.3, 8.D. 4.7. The S.D. for grade VI represents but 144 answers 
from a mean of 28. The correlations for grade VI, it will be noticed, 
are larger in the mean than those for grade IV for the reason, almost 


certainly, that the S.D. is larger. With a random sampling of pupils, 
the correlations would be even larger. 








- 
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The consistency of group performance appears more clearly when 
the scale score is converted into “Questions answered.”’ For example, 
the 8.D. of the mean scores on different forms from the composite is 
1.9 for grade IV (see Table III, last column). This means a S.D. of 
approximately three quarters of a question on a basis of 23, which is 
the mean for the grade. For grade VI, the 8.D. is about 24 of a ques- 
tion on the basis of 28. This indicates a rather high degree of con- 
sistency in performance for the class as a whole in terms of the test 
units. The units, however, are really very coarse, so that when we 
convert them into grade norms, the variability of the class means on 
the several forms becomes so great that the material cannot be used 
instructively for measuring monthly improvement, even for groups. 

In Table V the scores of Table III have been converted into grade 
norms, using the tables provided by the authors of the test. The 
S.D.’s are 0.32 and 0.75, of a school year. 














TABLE V 
Showing the grade norm equivalent of the mean scores yielded by each form 
| | | | | Com- 
Forms | 1 ie ee ee ee | 5 posite | S.D. 
| | | (mean) 
, | 
GradeIV........... | 6.2 | 6.4 | 60 | 5.5 6.4 | 6.2 | 0.32 
eS Ce | 7.5 | 7.7 | 92 | 8.0 | 8.5 | 8.0 | 0.75 











An inspection of the grade status yielded by the individual tests 
reveals wide variations. The variability for individuals is, of course, 
very much larger. It is quite clear that if a precise measure of an 
individual is required several forms must be given or new forms, 
including more numerous and much smaller steps, must be devised. 

The correlations of the composite scores of 2 or more tests with the 
composite of all 5 follow: 





| 2 with 5 | 3 with 5 — 4with 5 











NEE) citi ccceneewiadeh dees 0.70 0.85 0.96 
Gli Lik dy oihesidi's nbaeowene™ «5 0.88 0.97 





Perhaps the following table showing the deviations of 1, 2 or more 
tests composites from the composite of all 5 is more illuminating: 
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; TABLE VI 
Deviations from the composite of 5 tests’ 
Grade IV Grade VI 

‘a In scale | In grade | Seale Grade 
score years | score years 

——— = = Oy a i 
Composite of 4 tests................| 0.35 0.04 | 0.1 0.033 
as la baecets 0.70 0.112 | 0.2 | 0.066 
I Sis ic os bot Oieacee | 1.40 | 0.224 | 1.7 | 0.560 
Bhs hi 60 ose cow eees | 1.90 | 0.314 | 1.8 0.590 








This table shows that if we give but one test, the class mean (grade 
IV) will deviate 0.31 of a school year’s progress from the true score. 
Calling 9 months a school year, if we give two tests and combine them, 
the class score will deviate 0.22 of a year; for 3 tests about 1 month, 
for 4 tests about 445 of ayear. This means that if we wish to measure 
monthly progress, we will need to secure a composite score of 3 or 4 
tests, requiring 144 to 2 hours time. To measure adequately the 
monthly progress of each individual will require at least 3 tests. 

The data illustrate in an interesting way what seems to be a fact 
that even the more carefully constructed of our educational tests are 
insufficiently refined for exact individual examination, in short periods 
of time. They are extremely useful however, since, inexact as they 
are, they are much more accurate than any information otherwise 
obtainable and since exact measures can be secured if enough time is 
given to this investigation. 

The Effects of Practice—What we know of the effects of practice 
makes it certain that many functions with repeated testing will show 
large improvement. This improvement is specific in character, and 
we are not, in such cases, warranted in assuming that ‘‘ general reading 
ability,”’ for example, has shown a corresponding development. It is 
imperative that each test be carefully examined for purposes of 
measuring the specific improvement. | 

A glance at Table III will show that the effects of two hours or more 
of practice on Thorndike-McCall is very small. The improvement is 
obscured by differences in difficulty of the tests, but the irregularity of 
scores shows that improvement cannot be great. 

Comparison of Thorndike Alpha 2 and the Thorndike-McCall.—The 
Thorndike-McCall is essentially the same as the Alpha 2 though 
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differing mechanically in the method of computing scale scores and 
in the groups from which the norms were secured. The correlations 
of Alpha 2 with any form of Thorndike-McCall is about the same as 
the inter-correlations among forms of the latter. The grade norms 
yielded by the two were found to vary but little, in the lower grades 
but very greatly in the upper grades. Table VII shows the data. 
The grade norms were based on tests given Feb. 1. 

















TABLE VII 
Average inter-correlations of | Correlations of Mean grade | Mean grade 
Thorndike-McCall Thorndike-Me- norm alpha 2! norm Thorn- 
Call with alpha 2 dike-McCall 
— nenemeeeineemneees ee Quneetemee —; - —" sie ieee 
grade 3 | 0.55 0.72 3.25! 3.2 
4 0.57 0.55 5.6 5.4 
5 0.58 0.40 6.15 6.6 
6 0.57 0.33 7.30 7.2 
7 0.60 0.52 8.50 11.2 
8 0.59 0.60 8.70 11.0 
SPAN sa ) 
mean | 0.57 0.53 | | 
S.D. | 0.02 0.11 | 











Correlations of Thorndike-McCall with other Criteria.—A survey of 
Table VIII reveals certain facts about the Thorndike-McCall test; (1) 
It yields a correlation of 0.50 with almost any other single test of com- 
prehension except the Brown, and a somewhat smaller coefficient 
with single measures of rate. It yieldscorrelations of 0.50 with vocabu- 
lary tests and 0.57 with an oral reading test. The correlation with 
the corrected composite is 0.73 as compared to 0.52 with the composite 
of speed. The latter correlation is significant however, since this 
test more than any other makes no pretense at measuring speed. The 
mean correlation with the composite of group intelligence tests is 0.69 
showing that in all likelihood the ability to “understand sentences”’ is 
considerably involved in most of our group tests. 

The correlations with the Stanford Mental Age are interesting, in 
particular the fact that the coefficients become rapidly higher as we 
advance from grade to grade. There is a temptation to jump to the 
conclusion that reading bears a high relation to general intelligence 





1The figure 3.25 means one-fourth of the way through grade III; 3.0 would 
mean at the beginning, 3.5, midyear, etc. 
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only when it reaches the upper limits of mechanical perfection, usually 
assumed for Grade V (or better Grade VI) but there are other possible 
explanations. If this hypothesis were correct, it would mean that our 
conventional group tests, depending considerably on reading, are 
inadequate in the lower grades. On the other hand this is not equiva- 
lent to saying that reading for a third grade child is a very variable 
performance (the inter-correlations among the reading tests for 
Grade III are, in fact, as high as for any other grade) but merely that 
the reading performance, though stable, is not highly correlated with 
general intelligence, if the Stanford Mental Age is an adequate criterion. 
This matter needs intensive investigation. It may be that we should 
inform teachers that the M.A. is a real measure of intelligence, but 
that it will not predict exactly a pupil’s achievement in the lower 
grades. We expect to investigate the correlations with performance 
in other school subjects in a later paper. 

The study of certain cases of difficulty in reading has yielded 
information concerning the functions measured by the Thorndike- 
McCall tests which has been obscured in the correlations. 

Case 8.S.\—A girl in Grade VIII has an IQ known to be above 125. 
She is an excessively slow reader. Her difficulty is functional, 7.e., 
there was found no organic defect of any kind in any of her bodily or 
nervous mechanisms. Her oral reading is poor for Grade III. Her 
speed in silent reading measured by speed tests listed above, Gray’s 
Silent Reading tests and informal tests, was found to be approximately 
one word per second. The Thorndike-McCall yielded a score of 65, 
compared to the mean of 63.4 for her grade. Score 65, in McCall’s 
norms is that of a pupil half through the Grade XI of schools at large. 
We are convinced that this is an adequate measure of her “ comprehen- 
sion,” excluding speed. No other test of comprehension gave her a 
high score. On the Courtis comprehension she scored 24, the next 
lowest being 35, the mean 55.6 for the class. _Monroe’s comprehension 
score was 15 compared to a class mean of 38; the Burgess gave her a 
score of 8, the class mean being 67.8 and so on. Several other cases 
yielded similar results which are, to our mind, rather convincing 
evidence of the diagnostic value of the Thorndike-McCall in critical 
eases. It probably is the only test measuring a certain type of 
“power” in comprehension, unaffected by the mechanical factors of 
reading. From our data it appears that the test is but little subject 





1 This case and others will be considered in detail in a later paper. 
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to improvement through specific practice, and there is consequently 
doubt as to whether it ever, except at beginning stages of learning to 
read, yields a measure of the amount of effectiveness of instruction. 
It has frequently been noted that schools in which little or no effort 
is made to teach reading, make a good showing on this test, whereas 
they may do badly in instructional material such as spelling and 
arithmetic. 

It has also been said that this test is simply another measure of 
intelligence. It probably is a measure of one sort of “verbal intelli- 
gence”’ and is, on that account, one of the most useful of our tests. 
There is a great need for further investigation for purposes of discover- 
ing for certain whether the function represented in this test is one 
that will yield to practice, or whether it is one which develops primarily 
as a result of inner growth. 


THe BurcEss PicruRE SUPPLEMENT TEST 


The Burgess tests consist of a series of 20 pictures with a paragraph 
of about 55 words following each. The paragraph includes instruc- 
tions to complete the picture by drawings or by writing words. The 
paragraphs are said to be of equal difficulty of vocabulary, phraseology 
and thought. The score is the number of paragraphs the directions 
of which are properly fulfilled in 5 minutes. Several forms of the 
test are now available.! 

In most respects the experimental, statistical and reflective work 
behind the Burgess tests is admirable. Since it appeared to be a 
test of great promise, certain experiments were devised to analyse 
further some of its features. Form I was given to all grades in the 
usual way, and about 7 weeks later the pupils of grades III, IV and V 
were tested individually with the same form. It was clear that no 
pupil had sufficient recollection of the material to influence the score. 
Each child went through the first twelve paragraphs and by the use 
of two stop watches the writer who conducted the tests, was able to 
measure (1) the time spent in actual reading and (2) the time spent 
in writing or drawing the answer for each paragraph. 

The Reliability of the Burgess Scale——The correlations obtained 
from the ratings of the two trials were: 


1 Burgess, May Ayres. The Measurement of Silent Reading. Russell Sage 
Foundation, Educ. Monograph, 1921. Pp. 163. 
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Coefficient of Coefficient of 
correlation reliability 
Misc oe | 0.62 | 0.77 
Sa Naa Ne ay 0.59 0.74 
I or Pod eh etd eco Ske : 0.66 7 0.79 
| Rome screams | 
Ee ek Rea ey | 0.62 | 0.766 





The mean coefficient of reliability is slightly lower than that of 0.83 
obtained by Mrs. Burgess on grades II to VI at the Lincoln School, 
a difference to be accounted for, very likely, by our smaller 8.D’s. 

Mrs. Burgess, in her monograph, quite rightfully questions certain 
interpretations of coefficients of reliability. She says they give us 
‘“‘more information about the children than they do with regard to the 
test’’ and their test scores “‘may vary widely from day to day and 
still be actual true measures of ability on each occasion. Under such 
conditions the fact that the scores vary from trial to trial does not 
reflect any inaccuracy or inadequacy of the test.’”’ It may not, in 
some cases, but usually wide variability does indicate, in educational 
instruments, an inadequacy of the test. The scale, to meet present 
day demands, must either be improved or lengthened or both, so as 
to give consistent results. One factor which we can point out in 
passing is the coarseness of the units involved in this test. In grade 
IV, for example, nearly half a minute on the average is devoted to each 
paragraph. No credit can be earned unless the paragraph is com- 
pleted. In grade III, subjects spend more than a minute on para- 
graph 8, a few spent nearly two minutes. Many children of really 
unequal ability thus earn the same score. 

Another possible cause of apparent inconsistency in performance 
is the rather rough method of scoring. No credit is given unless the 
paragraph is correctly supplemented. We have found that experts in 
grading are frequently uncertain whether the markings “exactly 
follow directions.”” That others have had the same experience is 
stated by Whipple; ‘‘Those who are using the new Burgess Silent 
Reading Scale are raising many questions concerning the scoring of 
doubtful performances.”! The list of rulings given by Whipple should 
have accompanied the original tests. When the penalty for an 


1 Bulletin of the Bureau of Tests and Measurements. University of Michigan, 
No. 18, 1921. 
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erroneous solution is as heavy as it is in this the test, the criteria, for 
grading should be very carefully standardized. 

Mrs. Burgess in her monograph has specially insisted upon the 
observance of one of the commonly accepted canons of scientific 
methodology, namely the control of all variables save one—‘‘ Law of 
the single variable.’”” She has endeavored more seriously than many 
other workers in this field to put this law into effect. ‘‘While the 
scale was being developed every endeavor was made to construct the 
paragraphs so that they should be of equal difficulty as reading material, 
of equal difficulty with respect to the instructions they contain, and of 
substantially equal requirements in the time necessary to read the 
paragraph and make the mark which supplements the picture.’ 
(p. 107). So far as empirical work is concerned this was done by 
giving the paragraphs in groups of 20, in one of two orders, to many 
children, counting the number of times that the instructions were 
successfully fulfilled. In the case of Form I, the paragraphs are of 
satisfactory equality with respect to this criterion. In spite of the 
unusual care taken to secure equality among the variables it appeared, 
when the test was first given at Scarborough, that the paragraphs were 
unequal in two respects which influenced upon the score, namely, the 
time required to read and the time required to draw the supplement. 
It also appeared that in many cases nearly as much time was spent in 
drawing as in reading, and if so, unless the correlation between reading 
and drawing was approximately + 1.00, the test was subject to a very 
serious defect. The individual. examinations were conducted to 
determine the facts. 

Only those paragraphs which were correctly solved were considered. 
Table IX shows that in terms of the time required to complete the 
paragraphs, they are not of equal difficulty. Paragraph 8 requires 
more than twice as much time as 1 or 3 although, as Mrs. Burgess’ 
data show it is correctly solved just as frequently, namely, in about 
9 cases out of 10 attempts made by school children in general. The 
S.D. from the mean total times are large for all grades. Opportunity 
is afforded for wide variability of performance in individual cases and 
there is reason to fear that additional forms of the tests, if standardized 
by the single criterion employed might yield quite different scores. 
Additional forms were not in print at the time our experiment was 
conducted, but a recent study! shows that such is the case, Form 2 
yielding a score about 8 per cent higher than Form 1. 





1H. C. Daley: Journal of Educational Research, June, 1921. Pp. 71-72. 
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Table IX shows that the inequality of time required for the several 
paragraphs is due more largely to time spent in drawing than to time 
spent in reading. The §8.D.s for reading are not unsatisfactory, 



























































TABLE IX 
Grade 3 Grade 4 Grade 5 
Para- A B C A B C A | B C 
graph | Mean! Mean | Mean| Mean | Mean | Mean | Mean | Mean | Mean 
total | time | time | total | time | time | Total | time | time 
time | _ to to time to to time to to 
read | draw | read | draw read | draw 
| | | | | 
1 | 29.7) 25.5) 4.2 | 21.0 | 16.2 | 4.8 | 19.5 | 17.8 | 2.2 
2 | 39.0 | 23.4 | 15.6 | 22.5 | 15.3 | 7.2 | 27.1 | 14.3 | 2.8 
3 | 33.7 | 30.0) 3.7] 20.6 | 16.0 | 4.6 | 15.9 | 12.3 | 3.6 
oun 
4 | 40.2 | 29.7] 10.5 | 26.8 | 17.6 | 9.2 | 24.1 | 16.6 | 7.5 
5 | 37.1/ 31.1] 6.0] 23.3 | 17.6 | 5.6 | 21.5 | 16.9 | 4.6 
6 | 46.0 | 42.0) 4.0] 27.1 | 23.8 | 3.3 | 23.6 | 19.9 | 3.7 
7 | 40.7 | 32.4| 8.3] 25.1 | 17.7 | 7.4 | 22.2 | 16.2 | 6.0 
8 | 61.4 | 34.7 | 26.7 | 42.2 | 18.7 | 23.5 | 44.5 | 20.5 | 24.0 
9 45.0 | 37.0| 8.0| 29.5 | 22.1 | 7.4 | 30.0 | 23.2 6.8 
--= | | 
10 | 38.5 | 32.8) 5.7| 22.1 | 15.9 | 6.2 | 22.4 17.38 | 5.1 
11 | 45.0 | 38.0) 7.0} 26.0 | 20.0 | 6.0 | 24.0 18.0 | 6.0 
12 | 51.2] 45.7] 5.5) 27.9 | 23.1 | 4.8 | 28.0 | 22.0 | 6.0 
Mean | 42.3 | 33.6| 8.8| 26.2 | 18.6 | 7.5 | 25.2 | 17.8 | 6.5 
A.D. ! 7.8! 5.6 | is 6.5 | 30| 5.0] 7.4! 3.1 | 5.5 
Percentage of total | | | 
time spent in drawing, | 
| errs oe ae ene Be Re 5, i ors Pere 26 
| | | | | | 








although there is, to be sure, room for improvement. A more serious 
matter is the fact that one quarter of the total time is spent in drawing, 
and that the time varies greatly from paragraph to paragraph. For 
example, the range is from 3.3 to 23.5” for grade IV with a mean of 
7.5”. and 8.D. of 5.0”. The time for drawing likewise varies greatly 
from individual to individual. In drawing the three feathers to 
complete Paragraph 4, for example, some rapidly draw three dashes or 
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ovals, while others draw the details of a feather with great care. Time 
thus spent obviously has nothing to do with reading and the varia- 
bility in drawing time will seriously reduce the validity of the test 
unless there should be a very high correlation between speed of drawing 
and reading ability. On a priori grounds we would not expect it. It 
should be noted, in passing, that our children were specially warned to 
waste no time in drawing and also that they have had unusual amounts 
of practice in taking all kinds of group tests. 

Column A in Table IX gives the score which the test actually 
utilized, 7.e., mean total times to read and draw. Column B gives 
the mean time spent in reading. Correlating column A and B, it 
is found that r = 0.52, 0.38 and 0.18 for grades III, IV and V respec- 
tively. Since the frequency of making a correct solution is constant 
for the several paragraphs, these coefficients show that the scores 
actually utilized by the test parallel rather poorly the scores which 
represent reading time precisely. This is merely another way of 
showing that the validity of the test will depend in part on the correla- 
tion between speed of drawing and speed of reading. 

Table X shows the correlations obtained from the speeds displayed 
by the subjects in reading, drawing, and the total of the two. For- 
tunately for the test, positive correlations of 0.16, 0.24 and 0.23 were 
obtained for speed of reading with speed of drawing, which accounts 
in some degree for the coefficients of 0.69, 0.93 and 0.85 between ‘‘ total 
time” and “reading time.”’ It is certain, however, that the large 
amount and inequality of the drawing time makes this test less useful 
than it might otherwise have been. Columns 4, 5 and 6, Table X 
































TABLE X 
Showing correlations among several scores in the Burgess test 
| 9 3 4 5 6 
| Speed | Speed | Speed 
| Speed | Speed an ti of 
d qd 7 Para- | para- | para- 
rs | ore graph 1 | graph 6 | graph 8 
1. Speed of reading | Grade III | 0.69 0.33 0.65 0.42 0.21 
and drawing...... IV 0.93 0.57 0.71 0.67 0.47 
V 0.85 0.48 0.68 0.71 0.38 
2. Speed of reading. .| Grade III 0.16 0.73 0.53 0.16 
IV 0.24 0.66 0.63 0.21 
V | 0.23 0.63 0.52 0.29 
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show that paragraphs 1 and 6, in which time for drawing was little, 
yield a much higher correlation with the total score for reading than 
does paragraph 8 on which the extreme amount of time is devoted to 
drawing. 


The results of the study of the Burgess test may be summarized 
as follows: 

1. The units are rather coarse. This is not a valid objection if 
sufficient time is allowed. It is a more serious matter, of course, 
in the lower grades, where from 5 to 7 paragraphs only are read by 
the median child in the time allowed—5 minutes. 

2. More specific directions for scoring should have been provided. 

3. Too much time is devoted to drawing—about one quarter. 

4. The various paragraphs, while equal in difficulty as regards 
success in supplementing the pictures, are not of equal difficulty 
on the basis of time required to complete the reading and drawing 
and are therefore not of equal value as measures of reading. 

The general idea upon which the Burgess test is based is excellent. 
If the time for drawing were reduced to a minimum, and other changes 
as suggested above made, it would, in our opinion, be vastly superior 
to any existing test of reading rate. In its present form it is very 
useful, as will be indicated by the correlations which follow: 

Correlations of the Burgess with Other Criteria.—Table XI gives the 
correlations of the Burgess test with the composites of rate and com- 
prehension and the several separate tests. On the whole, the correla- 
tion with the composites are high, higher than those yielded by any 
other test and equally high with rate and comprehension; 0.82 and 0.80 
respectively. Performance on the Burgess is relatively consistent 
from grade to grade, the S.D.’s being small relative to those shown by 
other tests (see Table I). Allowing for differences in the S.D.’s for 
grade performance, the test seems to be about equally useful in all 
grades with a possible exception of grade VIII. 

The correlations with the Stanford-Binet are interesting. They 
increase from nearly zero for grade III to .56 for grade VI. The same 
relation was revealed by the Thorndike-McCall test. If these data 
represent the real relation of the sort of intelligence measured by the 
Stanford-Binet and reading ability, the grade differences will be of 
marked import. 

Correlations with the composite of group intelligence tests are 
much higher and show no real variation as we pass from grade to grade. 
Reading ability is demanded in most of the group tests, and the 
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identity of function probably accounts for the high correlation. 
However, it is a notable fact that several hours work in group intelli- 
gence tests yields a correlation with the reading ability composite 
which is not as high as 5 minutes work on the Burgess. The intelli- 
gence tests measure something more than, and something different 
from reading. 

The Burgess test agree most closely with Directions (r. = .76 + 
S.D. .11) and with Monroe Rate (r. = 0.72 + S.D. 0.08) as might have 
been expected. It yields a correlation of 0.5 or better with any test 
save Brown’s comprehension where the correlation is zero. The 
correlation with the Thorndike-McCall is lowest (r. = 0.48) with 
the exception just noted. This does not indicate an inadequacy of the 
Thorndike test, but rather indicates a fact, elsewhere verified, that the 
Thorndike test measures a quite different—even if correlated—func- 
tion and that the two make an excellent team. 

The correlations with. the vocabulary tests are high; likewise 
that with Gray’s oral. A wide reading vocabulary and mastery of the 
mechanics of reading is typical of the reader who excels. 

(To be concluded in November.) 











THE RESULTS OF RETESTS BY MEANS OF THE BINET 
SCALE 


J. E. WALLACE WALLIN 
Bureau of Special Education, Teachers College, Miami University 


This study is based upon two testings in the St. Louis School 
clinic of 136 cases, three testings of 16 of these cases and a fourth 
testing of one of the cases. The retests were made at varying in- 
tervals. The average interval between the first and second tests for 
the entire number was 2.2 years, with an extreme range of from one- 
half year to six years. The average interval between the second 
and third tests was 2 years, based on the 16 cases given a third ex- 
amination, while the range was from one year to almost four years. 

The reasons prompting the requests for the re-examinations of 
these pupils were as follows: A few were re-referred to the clinic 
because they had been excluded from school owing to low mentality, 
and the guardians had applied for readmission. A few had been 
demoted to the kindergarten because of low mentality. When the 
eligibility requirement for admission to the special schools was lowered, 
they were referred to the clinic with a view to assignment to a special 
school. Many such cases, however, were reassigned without further 
examination, because it was evident that a second examination was 
not needed. A few came from the special schools. The parents 
wanted them returned to the grades. But the large majority had 
been assigned to ungraded classes, and were referred because they 
failed to make adequate progress, either in the ungraded class in 
which they had been placed or in the regular grades in which they 
had heen retained because an ungraded class was not available.? It 
is evident that this represents a highly selected group of subnormals*® 


1 The authorities at one time fixed an intelligence age of six years as the entrance 
requirement for a feeble-minded child. This was subsequently lowered to five 
years. The present requirement fixed by state regulation, is a minimum intelli- 
gence age of about 3 years, or an IQ of about 30. 

2 The ungraded classes have been instituted for borderline, intellectually back- 
ward, and pedagogically restorable pupils, as explained in Problems of Subnor- 
mality, Chapter III, 1917. 

’The diagnoses made at the first examination were as follows: normal, 6; 
retarded, 12; backward, 48; borderline, 24; potential feeble-minded, 11; morons, 
4; potential moron, 1; imbeciles, 12; idiot, 1; and deferred diagnosis, 17. All the 
categories except the last represent progressively graver degrees of intelligence de- 
ficiency. The average IQ at the first examination for those examined by the 1911 
Binet was 0.79 (119 cases) and for those examined by the Stanford 0.61 (15 cases). 
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most of whom were reported for re-examination because they failed 
to make adequate progress in the ungraded or regular classes. 

In view of the circumstances under which the children were exam- 
ined, the expectation would be that the relative intelligence scores 
would show a progressive decline with each examination. What do 
the facts show? 

Referring first to the diagnosis made, based on all the facts gath- 
ered on each case, we find that the second diagnosis was the same 
as the first in the case of 42 subjects, 14 of these having been diag- 
nosed as backward, 10 as borderline and 10 as imbeciles. A higher 
classification was given to eight cases, and a lower classification 
to 70. The diagnosis was “deferred” the first time on 17 cases. 
We have not attempted to indicate here whether they were advanced 
or reduced in the later classifications. Of those given a lower classi- 
fication, 7 were reduced one-half step,' 38 one step, 8 a step and a 
half, 4 two steps, eight two and a half steps, and one each three and 
a half and four and a half steps. Both of the latter who had tested 
normal or almost normal during the first examination, proved to be 
feeble-minded. The intervals between the examinations were 4.5 
and 3.6 years, respectively. Twenty-six were reduced from backward 
to borderline (21) or to potential feeble-minded, 11 from borderline 
(9) or potential feeble-minded to feeble-minded, 8 from backward to 
feeble-minded, and 7 from retarded to backward. In the third exam- 
ination three were given the same classification, and 11 were reduced, 
one one-half step, 6 one step, 3 two steps, and one two and a half 
steps, the diagnosis of the other two being deferred. 

Turning to the more objective criteria, we find that the successive 
IQ’s were higher for 12 subjects with the 1908 scale, for 16 with 
the 1911 scale, and for 7 with the Stanford scale.2 The average 
amount of improvement between the tests for these subjects was 6.6 
IQ in the 1908 scale, varying from 2 to 13 1Q; 4.3 IQ in the 1911 
scale, varying from 1 to 8; and 8.1 IQ in the Stanford scale, varying 
from 2 to 17 IQ. It is evident that as determined by the measur- 
ing scales, some of these pupils improved their position. A few 
made unexpected advances, the number gaining seven or more IQ 


1 “Potential feeble-minded”’ and “potential moron” are counted as half-steps. 

2 It should be explained that many of the first and second examinations were 
made before the Stanford scale was in use and that, because of limitations of time, 
not all subjects who were tested after the Stanford scale was adopted were also 
given the 1908 and 1911 tests. 
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points amounting to 6 in the 1908 scale, 3 in the 1911, and 4 in the 
Stanford. The average improvement was greatest in the Stanford 
scale. 

All of the others made lower IQ’s in the successive tests, except 
4 whose IQ’s were the same in the 1908 scale, and 3 in the 1911. 
The average reduction in the second test amounted to 7.5 IQ in 
the 1908 scale (varying from 1 to 17); 7.8 IQ in the 1911 (varying 
from 1 to 23); and 5.1 IQ in the Stanford (varying from 1 to 10). 

While the vast majority of these subjects suffered relative deter- 
ioration in intelligence (as measured by the IQ), practically all made 
absolute gains. The average improvement in the 1911 scale from the 
first to the second examination, for those (52) who were given the 
test both times, was a year and a half, the average time interval being 
2.37 years, whence the annual improvement amounted to 0.42 of a 
Binet age. For those (8) who were given the 1911 a third time, 
the average improvement, during the interval of 2.25 years between 
the second and third examinations, was 1.1 year in terms of the 
scale, or 0.44 of a Binet age per calendar year. In the Stanford 
scale the improvement in intelligence from the first to the second 
examination, separated by a time interval of 1.87 years, was 0.67 year 
(based on 15 subjects who were retested by the Stanford), which 
is equivalent to a yearly improvement of 0.35 of a Stanford-Binet age, 
while the corresponding gain from the second to the third test, sepa- 
rated by an interval of 2 years, was 0.57 year (1 case). Not a single 
case showed a loss in intelligence age in any of the scales. In other 
words, not a single one of these children suffered actual dementia © 
as determined by the scales, which may appear singular in view of 
the motley makeup of this group of cases, 2 of whom were epileptics, 
2 choreics, 20 unstable and neurotic, 4 psychopathic, 29 speech de- 
fectives, 15 unruly and 21 wordblind (19 of these being dyslexia cases). 
These cases represent pretty much the “ragtags”’ of our run of cases. 

So much for the extent of the gains and of the losses. We are 
also interested, however, in ascertaining how large the differences 
are between the IQ’s of the successive tests irrespective of sign, 
1.€., irrespective of whether the difference is a gain or a loss. Table 
I gives the average difference between the IQ’s received in the first 
and second and in the second and third tests, and the average of these 
differences. 

It will be observed that the average difference between the results 
in terms of IQ units, is not so very pronounced when the IQ’s are 
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TABLE I 
Average IQ difference between successive tests 
Between Ist and yateypg Between all 
ne PA 2d and 3d 
Scale used - tests tests 
= 
No.! | Ave.? No. | Ave. No. Ave. 
ia donde ae sui che eae 52 6.9 8 5.7 613 6.6 
RE reer 52 6.3 8 6.1 61 6.2 
Ape rane 15 6.4 3 5.6 19 6.1 
1911 and Stanford..... 104 10.7 15 11.4 120 10.2 
1908 and Stanford..... 104 14, 15 14.9 120 14.1 























1 Number of cases. 


2 Average difference between the IQ’s irrespective of sign. 
3 Includes the case given a fourth examination. 





Third examination 











Chron. |1908/1911) Stanf. 
age |I1Q|1IQ/; IQ 
10.66 | 87 | 87 
11.75 | 87 | 84 
10.58 84 
12.08 | 71 | 71 63 
13.5 | 84 | 84 73 
9.08 73 
10.16 | 81 | 77 72 


TABLE II 
Successive 1Q’s for subjects showing the largest differences 
First examination Second examination 
mm Chron.|1908/ 1911) Stanf.| Chron. |1908)1911) Stanf. 
age | IQ | IQ; IQ age | 1Q | IQ; IQ 
1 | 8.16 98 96 9.16 | 92) 90 
2 | 7.41 97, 97 9. 93} 93 
3 | 7.25 97| 88 9.16 83; 79 
4 | 7.25 |1.00) 94 8.58 | 93) 86 
5 | 6. 93) 89 8.08 |1.07|1.07 
6 | 8.8 84| 77 10.2 75| 75 
719.58 | 90) 96 sas 11.4 90} 85 
SF 3 een ae eee we 
9 | 7.5 93} 91 eS ee ae 
10 | 8.91 | 88 80 . 11.58 | 71; 69) 66 
11 | 8.66 | 92) 86 11.75 | 80) 76) 66 
12 | 6.75 | 74) 74 9.75 | 87) 82) 66 
13 | 8.66 | 90) 90 11.25 65 
14 | 8.25 | 76) 76 9.66 50 
15 | 8.33 | 92) 92 9.91 76 
16 | 8.16 |1.07| 98 Tk a a 
17 | 7.66 | 96) 91 8.66} 88 84 .. 
18 | 7.91 | 99) 91 - 12.25 | 72) 68 57 
19 | 8.5 33 10.1 45 
20 | 9.5! 74 12.66 59 






































1 A recently examined case not included among the 136. 
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based on the use of the same scale. The difference in the successive 
testing amounts to less than seven IQ points. It is practically the 
same for the 1911 (Vineland) and Stanford revisions, and is not 
much higher for the 1908 revision. It is slightly larger for the first 
and the second tests, than for the second and third tests. 

The difference is considerably greater when the first measurements 
were made by the 1911 or 1908 scales and the later measurements 
by the Stanford, for the reason that the Stanford grades lower, as 
will be seen presently. 


Although the average differences between the successive IQ’s are 
not very pronounced, whichever scale is used, individual instances 
occur in which the intelligence scores earned differ rather widely. 
Table II contains a score of such cases. Space permits reference to 
only a few of these. 


Case 6.—American, an only child. First examination, June, 1915, age 8.8:— 
Conduct good, but nervous, giggles constantly, chews ties and school materials, 
no concentration or retention, very poor in spelling and arithmetic, best in reading 
and singing. First steps at two, talked at about three. Father’s family nervous 
and excitable, father and grandfather hard drinkers, father abusive to mother 
during pregnancy, and she “felt like committing suicide.’’ Mother’s people all 
“healthy.” Physical examination: 4 dental caries and notched teeth. Recom- 
mended to an ungraded class and reexamination after a year, diagnosed as neurotic, 
deferred. 

Second examination, November, 1916, age 10.2:—School report: cannot 
concentrate or retain, one-half year in kindergarten, 2 years in first grade, one 
half-year in second grade, repeating the work, greatest interest in animals, amiable 
disposition, takes correction kindly. Diagnosed as borderline, recommended’ to 
ungraded class, to which he had not been transferred, as no class was available. 

Third examination, September, 1918, age 12.08:—School record: in III-3, 
doing I—4 successfully, best in spelling, poorest in number and reading, ‘‘good 
in physical and mental characteristics,” ‘‘physically a fine upstanding boy,” 
“beloved, good natured.’”’ Physical examination: looks normal, intelligent 
expression, four dental caries, conjunctivitis. Extremely deficient in reading, 
diagnosed as a moron, and assigned to a special school (had never been placed 
in an ungraded). 

Between the first and the third examination he lost 13 IQ by the 1908, 5 by 
the 1911, 21 IQ by the Stanford as compared with the initial 1908 IQ, and 14 IQ 
as compared with the initial 1911 IQ. The report from the special school in June, 
1920, indicates that his worst fault is lack of concentration, especially in academic 
work, he is very nervous and forgetful, not reliable, a tale-bearer, easily influenced, 
but generous and kind-hearted, he was doing second grade work in reading and 
spelling, and II-2 in arithmetic, greatest improvement in industrial work, in 
which, however, he shows little interest. Apparently the Stanford scale places 
this boy more accurately than the older scales. 
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Case 8.—First examination, April, 1918, age 8.25. School record: sullen, 
lacking in concentration, easily discouraged, possibly doing kindergarten grade 
of work. Inmate of Masonic Home, family history unknown, except that father 
had pulmonary tuberculosis. Examination: stolid, vacant expression, right 
cervicals enlarged, fair physical condition, neurotic, lisps, lacramoseten dencies, 
diagnosis reserved, recommended to kindergarten. 

Second examination, February, 1919, age 9.08:—School record: 1 year in kinder- 
garten, one-half year in first grade, nervous, indolent, cannot concentrate, worst 
in reading, writing and all handwork, likes to work with dominoes. Examination: 
dull, infantile expression, one dental cavity, enlarged tonsils, stutters at times, 
lacramose tendencies still evident. Diagnosed as potential moron, and assigned 
to aspecialschool. Report from school after one month: conduct good, but poorly 
developed socially, little self-control, infantile tendency to weep. 

This boy made the significant gain of two years or 17 IQ by the Stanford in 
less than a year, and can be rated as not lower than a moron. To exclude such 
children from the benefits of the public schools on the basis of one Binet examina- 
tion would be hazardous. 

Case 12.—Russian. First examination, October, 1916, age 6.75. School 
record: in I-1, but doing little, poorest in everything requiring thought, greatest 
interest in games, amiable. Examination: lisper, stupid reactions, diagnosed 
as potential feeble-minded, recommended for special school or ungraded class, 
and reexamination after a year or two. 

Second examination, November, 1919, age 9.75. School report: in school 4 
years, repeated kindergarten five times, I-1 twice, I—-4 once and II-—1 three times. 
In II-3, but spending two periods daily in an ungraded class, doing first grade 
work in language and arithmetic, poorest work in language, best in reading and 
arithmetic, greatest interest in games, baseball, and swimming, has tried hard to 
overcome speech defect and has improved, good natured. Spoke single words at 
one and a half years, phrases at two, but did not talk well until seven, according to 
the mother who said he “got better after his diseases’’ (measles at four and scarlet 
fever at nine). Examination: post-nasal obstruction (tonsils removed at three), 
vision 1549 in each eye. Reads very well according to his intelligence level 
(Stanford 6.5), reading the Stanford ten-year selection in 30’’, with 3 misreadings 
and aid on “17 families,’ and reproducing nine memories. Diagnosed as potential 
feeble-minded, and recommended to a special school. 

Here we find a curious disagreement between the old scales and the Stanford. 
When measured by the 1908 and 1911 scales there was an improvement of 13 
IQ and 8 IQ, respectively, between the two testings. This improvement was 
transformed into a loss of 8 IQ in the Stanford scale. When comparing the Stan- 
ford rating with the 1911 secured on the same day, the Stanford shows a loss of a 
year and a half, or 16 1Q. We are satisfied that the Stanford rating is too low. 

In the psychomotor test (Seguin) he graded about nine years, according to 
the writer’s norms.' 

Case 18.—Italian. First examination, June, 1915, age 7.91. School record: 
in ungraded class, best in reading, poorest in spelling and arithmetic, inattentive, 
unable to concentrate, varies from day to day, learns a little parrot fashion, 





1 Psychomotor Norms for Practical Diagnosis, 1916, Table XLIX. 
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physically strong, fair conduct, pleasant. Father stabbed a man. Physical 
examination, one carious tooth. 

Second examination, November, 1919, age 12.25. School record: in un- 
graded, doing III—1 successfully, best in sewing, greatest interest in sewing, music, 
and industrial work, worst in arithmetic and language, erratic, quarrelsome, boist- 
erous, stubborn. Physical examination: enlarged lymph glands, two dental 
caries, a trace of strabismus. 

This girl tested normal by both of the old scales when first examined. Her 
subsequent history shows that this rating was entirely misleading. During a 
period of somewhat over four years her 1908 IQ fell 27 points, and her 1911 IQ 
23, while the difference between the first 1911 IQ and the later Stanford IQ 
amounted to 34 points. We are persuaded that the Stanford scale rates this girl 
too low. She is by no means an imbecile, and barely, if at all, a moron. She 
read the Stanford Selection in only 29 seconds, with only two errors, but reproduced 
only 744 memories. In the psychomotor test she did as well as a ten and a half 
year old child. The instances in which the Stanford scale rates too low are quite 
numerous in our experience. 

CasE 19.—American. First examination, May, 1919, age 8.5. School record: 
in kindergarten, greatest interest to run and play, unstable, troublesome at home. 
Aunt’s brother went insane in army, subject to auditory hallucinations, strayed 
away. Spoke single words at four years; epileptic seizures since seven. Examina- 
tion: inattentive, distractable, neurotic, slavers, occasional tendency noticed in 
left eye toward internal strabismus, adenoids and tonsils already removed, con- 
genital lues suspected, intelligence age by Stanford 2.83, diagnosed as imbecile 
and excluded. 

Second examination, November, 1920, age 10.1. School record: in school 4 
weeks, in I-1, doing nothing, dribbles at times, health good, although struck by 
an auto a year ago injuring the head and spine. Physical examination: slight 
bilateral ptosis, slight stenosis in right nostril, restless neurotic, verbal persevera- 
tion, wandering attention, loquacious. Assigned to special school, where he was 
reported as doing kindergarten or subkindergarten work. 


The interest in this boy is in the marked improvement which he 
made in a year and a half, amounting to 1.6 years or 12 IQ by the 
Stanford. With such a degree of improvement in an apparently 
hopeless low grade case—and we have school records of others evinc- 
ing a similar gain—it is surprising that we have rejected the doctrine 
that all feeble-minded children should be denied the privileges of 
the public schools. The fact is that it is frequently impossible to 
determine for years whether a young mental subnormal is feeble- 
minded or not. We have examined a number of children who were 
unjustly excluded from school on the basis of a low test score and 
the assumption that the quotient would always remain the same. 
We counsel caution in the matter of the exclusion of assumed hope- 
less defectives from the public schools. The place in which to train 
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the mass of mental defectives is in the public schools, not the state 
institutions, for economic, if for no other reasons. The public schools 
exist for the impartial service of society’s products. 

In the following case the discrepancy in the intelligence rating was 
just as great, but in the opposite direction. 


Case 20.—American. First examination, February, 1918, age 9.5. School 
record: in II-1, doing I-3 successfully, greatest interest in drawing, raffia and 
writing, poorest in number, reading and spelling, good in conduct and disposition, 
very slow in comprehension, frequently absent because of bronchial cough. Exam- 
ination: dental caries, myopia, neurotic, lisper, tendency to stutter, diagnosed as 
borderline and recommended to an ungraded class or a speech-correction class. 

Second examination, March, 1921, age 12.66. School record: in ungraded, 
doing first grade in reading and spelling and third grade in arithmetic, after 
six years in school, best in arithmetic, sewing, and fitting together parts of wagons, 
toys, etc., greatest interest in carpentry, and mending school furniture, poorest in 
language, reading, spelling and writing, very helpful, always willing to assist, but 
has become rather rough with smaller boys since father’s death. Examination: 
malformation of nose, speech much improved, decidedly deficient in reading, diag- 
nosed as moron, and assigned to a special school. This boy only advanced five 
months in intelligence in about three years, while his IQ declined 15 points. The 
small gain in the Stanford is partly due, no doubt, to the literary character of the 
scale. The boy has little ability in language. 


It is noteworthy that when examined on the same day by the 1908, 
1911 and Stanford versions, all the subjects grade lower by the Stan- 
ford scale, except one who grades .2 year lower by the 1908, while 
two grade the same in the 1911 and Stanford scales. The average 
difference between the Stanford and 1908 scales amounts to 1.16 
years and 11.1 IQ based on the first and second tests, the differences 
ranging from .3 to 2.1 years and from 2 to 20 IQ’s (38 cases). Based 
on the second and third tests, the average difference amounts to 1.31 
years and 12.7 IQ, ranging from .83 to 2.4 years, and from 7 to 29 IQ 
(8 cases). The average of these differences amounts to 1.19 years, 
or 11.4 IQ (46 cases). 

The average differences between the Stanford and the 1911 scales 
for the subjects who were put through the two scales on the same 
day amount to .66 year (ranging from 0 to 1.5 years) and 7.3 IQ 
(ranging from 0 to 17—38 cases), based on the first and second tests; 
and to .90 year (ranging from .5 to 1.5) and 7.7. IQ (ranging from 1 
to 13), based on the second and third tests (8 cases). The average 
of these differences is .71 year or 7.4 IQ (46 cases). The differences 
are greater between the second and third tests than between the 
first and second, possibly due to the increasing age of the subjects 
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and the absence of satisfactory tests in the higher ages in the 1908 
and 1911 scale. The average chronological age of all the subjects at 
the first examination was 8.96 years, at the second 11.26 years, and at 
the third 11.69.!_ There was a marked increase in the average chrono- 
logical age between the first and later tests. 

It is apparent that the average difference in the age rating and 
IQ between the 1908 or 1911 and the Stanford is too large to be 
ignored,” while the differences in individual instances are occasionally 
surprisingly large, so large that quite contradictory conclusions 
would be reached according to the particular scale employed. We 
agree with the generally accepted conclusion that the 1911 scale is 
more accurate than the 1908 (Vineland), except in the highest ages, 
but the facts are not available whereby it can be conclusively affirmed 
that the Stanford norms are more accurate than the 1911,* except in 
the upper ages, although the scale itself is much superior in various 
particulars. The necessity of validating the accuracy of the Stanford 
norms—and revising the tests and administrative procedure—on the 
basis of the testing of a large number of unselected children from vari- 
ous sections of the country is urgent. This should be done, in our 
judgment, by a disinterested committee of psychologists, several of 
whom should have had extensive experience in actual clinical work, 
the revised scale should be made available in an inexpensive edition 
at cost * while the publication of reprints should be freely permitted 
without the fear of infringement of copyright. Our tentative con- 
clusion, based on various considerations, is that most of the Stanford 
age norms are too difficult, thus exaggerating the subject’s deficiency. 
Porteus has reached a similar conclusion so far as concerns the tests 
above age VIII.°® 


1 The average age at the time of the second examination of the 16 examined 
three times Was 9.69. 

2 The corresponding differences between the 1908 and 1911 scales were 4.5 IQ in 
the second examination (ranging from 0 to 12), and 5.3 IQ in the third examination 
(ranging from 0 to 19). The 1908 IQ’s were lower than the 1911 in only 5 cases 
(during the second examination). 

3 An incomplete analysis of the comparative accuracy of the two scales appears 
in Preliminary Impressions of the Stanford Revision of the Binet-Simon Scale, 
Psychological Clinic, 1918, 1 f. 

* Consider the vastly heightened cost of the Stanford revision and materials 
compared with the inexpensive Vineland guide, record forms, and materials. 

5 Porteus, S. C. Condensed Guide to the Binet Tests, Training School 
Bulletin, 1920, 1 f. 








MENTAL GROWTH AND THE IQ 


LEWIS M. TERMAN 
Stanford University 


OTHER CONTRIBUTIONS ON THE VALIDITY OF THE IQ 


Wallin! presents a criticism of the IQ based on data from Stanford- 
Binet tests of 411 backward and feeble-minded children in the public 
schools of St. Louis. His main criticism relates to the IQ distribution 
found in his various classificatory groups. The main groups were 
designated as ‘‘normal,” “retarded,” “‘backward,” ‘borderline or 
potentially feeble-minded,” ‘‘morons,” “imbeciles,’’ and“ idiots. ’’ 
The author states that his classification of the subjects into these 
categories was based chiefly upon pedagogical and mental status, the 
latter determined by use of the Stanford Revision. Medical, social, 
and family data were also used. Nothing is stated with regard to 
how pedagogical status, intelligence test, medical and social data 
were weighted and combined. Presumably the final judgment was 
largely subjective, based upon empirical, offhand evaluation of the 
various kinds of data available. 

The author informs us, however, that in no case was the IQ com- 
puted until after the diagnosis had been made.? After the classifica- 
tion was complete the IQ distribution in each group-was examined. 
The extreme range of IQ for the different groups was as follows: 
“normal,” 95 to 108; “retarded,” 80 to 97; ‘“‘backward,’’ 59 to 94; 
“borderline and potentially feeble-minded,” 56 to 84; “‘morons,”’ 
48 to 70; imbeciles,’’ 21 to 65. That is, the range is wide and a large 
amount of overlapping is found. Hence the IQ is of no value for 
purposes of classification. 

The argument overlooks two very important considerations. 

In the first place, it is possible that about as much overlapping 
would obtain between Wallin’s classification and that of another 
equally competent clinician using the same methods. The truth is 


1J. E. Wallace Wallin: The Value of the Intelligence Quotient for Individual 

Diagnosis. J. of Delinquency, Vol. 4, 1919. pp. 109-124. 
See pp. 146-147 of The Intelligence of School Children for my data on 183 

re-tests of children above 110 IQ. 

2One might infer from the author’s discussion that life age beyond 15 or 16 
was used as divisor in computing IQ’s, although I can not be sure this was the case. 
As the subjects ranged in age as high as 19 years such a procedure would of course 
seriously affect the results. 
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that the names attached to these categories have as yet acquired 
very little exact meaning. There is little agreement either as to what 
they do mean or as to what they ought to mean. I should be sur- 
prised if the classifications by two clinicians, of the same children, were 
found to correlate more than 0.7. 

In the second place, Wallin’s classification is evidently not an 
intelligence classification at all. Just what it is, we are not informed, 
though in numerous articles he has made it clear that he considers 
various other factors as important as intelligence in the diagnosis of 
feeble-mindedness. On p. 124 of the article in question he defines 
feeble-mindedness purely in terms of social and vocational incompe- 
tency. This is a common use of the term and is of course legitimate for 
practical purposes. However, as I have elsewhere pointed out,! it 
is a concept of little use to science. One’s ability to get on in the 
world depends upon an indefinite number of accidental factors, in- 
cluding health, looks, inherited wealth, friends, local industries, the 
economic condition of the country, etc. Surely no one ever supposed 
that feeble-mindedness, in the sense of social incompetency, is accu- 
rately measured by the IQ. 

Finally, the author takes too literally the IQ classifications others 
have proposed. As for my own classification of children as normal 
(IQ 90-109), dull (80-89), borderline (70-79), feeble-minded (below 
70), etc., it never occurred to me that any one would construe this as 
marking off well-differentiated groups, or as intended for anything 
more than a rough tentative classification. The known. probable 
error of an IQ score would itself make any such rigid classification 
quite absurd. I myself have pointed out? that even if we had a perfect 
measure of intelligence we could not expect it to furnish an absolute 
index of an individual’s educational or social success. The following 
statements by me (p. 87, The Measurement of Intelligence) is explicit 
on this point: 

“It must be emphasized, however, that this doubtful group is not 
marked off by definite 1Q limits. Some children with IQ as high as 
75 or even 80 will have to be classified as feeble-minded; some as low 
as 70 IQ may be so well endowed in other mental traits that they 
may manage as adults to get along fairly well in a simple environment.” 





1 Lewis M. Terman: The Binet Scale and the Diagnosis of Feeble-mindedness. 
J. of Criminal Law and Criminology, Vol. 7, 1916. Pp. 530-543. 

2 The Intelligence of School Children. Pp. 97-110, p. 127 ff; also The Measure- 
ment of Intelligence, p. 80-81. 
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Elsewhere! I have stated my conclusions on this and related matters 
still more fully, in particular pointing out: (1) that no intelligence 
scale gives an entirely accurate measure even of intelligence; (2) that 
in the diagnosis of feeble-mindedness, medical, neurological, and social 
data are necessary; (3) that social competency and educational possi- 
bilities both depend largely upon non-intellectual mental traits; and 
(4) that no responsible psychologist would think of using a Binet test 
score as an automatic criterion of feeble-mindedness. ‘‘If any psy- 
chologist ever hoped to find such a simple standard as 12-year intelli- 
gence (or 75 IQ, etc.) an infallible criterion of fitness to be at large, 
surely he has long since been disillusioned. The writer does not for a 
moment suppose that those who have proposed such standards ever 
meant that they should be rigidly and mechanically applied’ (p. 
536). Continuing, I pointed out that the term feeble-mindedness is 
currently used in two very different senses, one referring to intellectual 
defect, the other to social or vocational incompetency. ‘Intellectual 
feebleness,’’ being a fairly definite thing and at least roughly measure- 
able, is a term usable in science; ‘‘feeble-mindedness” in the sense 
of ‘social in efficiency”’ is not. 

The limitations of the IQ have also been made the subject of a 
spirited article by Dr. Mateer.2 Data on fifteen specially selected 
cases are presented to show that the IQ does not always remain con- 
stant and that it can not safely be taken as a basis for differential 
diagnosis. ‘The IQ or C. I. A. of the individual feeble-minded child 
is sometimes not in the least a factor differentiating him from normal 
children of his age. His IQ may decrease steadily through even the 
earlier years of childhood, it may stand still, it may even temporarily 
increase. Even an IQ of 75 or 70 or 60 need not mean feeble-minded- 
ness. It may mean dementia, either in the sense of insanity or of 
other deteriorating neural condition, as for instance a juvenile paresis.”’ 

I do not think anyone would dispute Dr. Mateer’s contention that 
the IQ does not always remain constant, especially in the case of 
psychopathic subjects. Even as regards normal subjects its con- 
stancy is never, so far as I know, referred to as anything more than 
“relative” constancy, ‘‘approximate”’’ constancy, ‘‘a tendency to”’ 
constancy, etc. Everyone makes liberal reservations regarding its 





1 J. of Criminal Law and Criminology, 1916. Pp. 530-543. 
2Florence Mateer: The Diagnostic Fallibility of Intelligence Ratios. Ped. 
Sem., 1918. Pp. 369-392. 
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constancy as far as epileptic, insane, or other types of psychopathic 
subjects are concerned. 

Dr. Mateer gives no data for normal children, and her conclusion, 
“it is self-evident that IQ’s do not remain constant”’ (p. 385) can not 
be taken as demonstrated to be the rule even for feeble-minded children. 
As a matter of fact, the fifteen cases which she reports are nearly all 
admitted to be psychopathic. For example, the clinical descriptions 
of the cases are full of such phrases as the following: 

Case 1. ‘‘Psychotic.”’ ‘‘May develop an actual insanity.” 

Case 2. Has “always been peculiar.’”’ ‘‘Convulsions,” ‘head- 
aches and pains.” ‘Possibly epileptic.” 

Case 3. (Tests normal.) ‘‘ Neuropathic family.” “‘Steals,”’ “‘lies,’’ is 
“obscene.”’ “He is, undoubtedly, of a neuropathic predisposition.” 

Case 4. History of ‘‘killing animals.”’ “Likes blood.” ‘Already 
gives evidence of neural disturbance.” (Later test showed degenera- 
tive changes.) 

Case 5. Handicapped by “neuropathic instability.”” Tests nearly 
normal and is bright in school, but ‘‘masturbates,”’ “lies,’’ and is 
“listless,” ‘unreliable,’ etc. (The author suggests that the abnormal 
mental disposition may be connected with an enuresis which was 
present, possibly due to failure to form right conditioned reflexes in 
childhood.) 

The other ten cases, five of whom tested below 76 IQ ands even 
below 80 IQ, are said to be ‘‘ undoubtedly feeble-minded,” though one 
is reported as “bright in school.’’ ‘Four are cruel to children, two 
are cruel to animals to the extent of putting cats on hot stoves. 
At least three of them are pyromaniacs, eight have temper spells, 
seven of them are destructive. In several cases the parentage is 
unknown, but the rest are weighed down by a history of alcoholism, a 
little insanity, general inferiority but not feeble-mindedness, and three of 
them are illegitimate.”’ (Italics mine.) 

Dr. Mateer seems to hold it against the intelligence tests that all 
these children once tested at age. The initial tests, however, were 
made by the Goddard Revision. I calculate that by the Stanford 
Revision none of the initial IQ’s would have been above 90. In fact, 
the child who gave the highest initial IQ found by the Goddard Re- 
vision (103) was also given the Stanford Revision, with a resulting IQ 
of 90. If we make reasonable allowance for the scale used there are 
only five of her fifteen cases which show more than about 8 points of 
change in IQ over a period of one to five years. Probably anyone with 








Mental Growth and the IQ 405 


considerable clinical experience could easily duplicate this handful of 
exceptional cases described by Dr. Mateer. It may be done any 
number of times without destroying the usefulness of the IQ for such 
purposes as it is really intended to serve. 

Of course it is unfortunate that the IQ does not enable us to diag- 
nose psychopathy, epilepsy, enuresis, etc., or tell us whether the subject 
has or has not formed the appropriate conditioned reflexes. For 
such purposes, one is bound to admit, the IQ is distinctly fallible. It 
might be well to warn astronomers that there may be similar limita- 
tions to its usefulness in the prediction of eclipses! 

In a more recent article on the ‘Interpretation and Application 
of the IQ’! Professor Freeman has approached the problem from a 
different angle and has raised some important questions. He rightly 
observes that the validity of the IQ hinges upon the greater overlap- 
ping of mental ages in the upper years than in the lower; that it would 
require, for example, the standard deviation of mental ages of un- 
selected 10-year olds to be about twice that for 5-year olds, and the 
standard deviation for 15-year olds to be about three times that for 
5-year olds. He notes that those who have used Binet tests with un- 
selected children have usually found such increase in mental age over- 
lapping. When he examined the results of group intelligence tests, 
however, no such rule was found to obtain. Data from several group 
tests are presented, and in every case the variability, expressed in 
terms of point score, shows a tendency to remain constant between the 
ages of 7 or 8 and 12 or 13. 

I think there is no question about the correctness of Professor 
Freeman’s observation. I found the same thing three years ago for 
army test Alpha, and have since found it to hold also for the Otis, 
National, and Terman group tests. For example, in the case of 
unselected children of 8 to 14 years (about 175 at each age), the 
standard deviation of total score of Scale A and Scale B of the National 
Test remained almost constant at about 50 points through this entire 
age range. 

We thus have an apparent contradiction, and unless it can be 
shown that one or the other of these findings is not in accord with the 
facts, some explanation must be sought which will harmonize them. 
I believe that further investigation will confirm the essential correct- 
ness of both findings. As far as the Binet results are concerned, the 





1 J. of Educational Psychology, Vol. 12, 1921. Pp. 3-13. 
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progressive increase in mental age overlapping is shown in the data of 
various investigators. For example, in my tests of 905 unselected 
children the interquartile mental age range of 6-year olds was 10 
months, and of 12-year olds 20 months. Similarly, Bobertag! found 
1l-year olds to overlap 12-year olds on Binet tests almost twice as 
much as 6-year olds overlap 7-year olds. Striking confirmation of 
these results is found in Burt’s report on ‘The Distribution and Rela- 
tions of Educational Abilities” in the case of a representative group of 
31,965 London school children.? In this study it was found that 
“in educational ability normal children tend to vary [using the 
standard deviation as the unit] above and below the average level for 
their age as follows: at the age of 10, by at least 1 year;at the age of 5, 
by just 0.5 of a year; at the age of 15, in all probability, by nearly 1.5 
years, and throughout, by about one-tenth of their age.” (p. 31.) 
The standard deviations for the ages at which the subjects were 
considered representative were as follows: 


ee nak 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 
Stand. Dev. 
(years)..... 0.34 0.55 0.62 0.63 0.75 0.91 1.10 1.17 1.18 1.24 


The apparent contradiction might be explained as due to the shift- 
ing of the point score values of group tests in the different ranges of 
the scale. Perhaps no one would claim that a point in score has equal 
value over the entire scale range. On the other hand, one would 
hardly expect the shift in score values to be large enough to account for 
the phenomenon in question. Another explanation is suggested by a 
consideration of the psychological differences between the Binet scale 
and current group tests. The latter endeavor to measure the same 
functions throughout the range over which they are applied. Binet 
tests, on the other hand, to a great extent measure different functions 
at different levels. It is constructed on the theory that mental growth 
does not imply equal development of all the particular capacities at 
once, or in the same particular capacities at all periods; that certain 
differences in mental functions appear in a more or less definite order. 
It is adapted to bring out the fact that the 14-year old, for example, 





1Otto Bobertag: Die Intelligenzpriifungsmethode von Binet-Simon bei 
Schwachsinnigen Kindern. Zeitschrift f. ange. Psychol., 1912, 6. Pp. 495-538. 

2 Report by the Education Office submitting Three Preliminary Memoranda 
by Mr. Cyril Burt, M.A., Psychologist, on the Distribution and Relations of 
Educational Abilities. London, P. 8. King and Son, 1916. Pp. 93. 
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excels the 7-year old not merely in the maturity of certain mental 
functions, but that he is mentally able to do various kinds of things 
which 7-year olds can not do at all. The group tests, being so much 
more restricted, probably fail to bring out these differences to so great 
an extent as does the Binet, and as a result give a variability in the 
upper ranges which is less than it ought to be. In this respect they 
do not afford an entirely satisfactory basis for the psychological 
analysis of mental growth changes. 

My suggestions, however, are only tentative, and I shall hope to 
return to the question at another time. Professor Freeman has 
raised an issue of real importance. 











CRITERIA TO EMPLOY IN CHOICE OF TESTS 


RAYMOND FRANZEN 
Director of Research of the Public Schools of Des Moines, Iowa 
AND 
F. B. KNIGHT 
Asst. Prof. Educational Psychology, University of Iowa 


Instruments employed in the exact determination of the quantities 
of abilities and capacities involved in school procedure, have multiplied 
in the last years until the predicament of an administrator is no longer 
to find a test but to choose which of existent tests he will use. When 
we are in need of indices of reading ability, we must decide the com- 
parative value of the many tests which purport to measure reading in 
order that we may express our diagnoses in the best medium available. 
Never a day passes that some officer of public instruction does not 
decide to use some one test for some one purpose. What criteria 
have influenced his choice? 

Geographical preferences are not economical; the mails will allow 
transmission of tests from one area to another. And still one reading 
test is predominant in the west and another in the east. Advertise- 
ment should bear no weight in the dissemination of tests; we ought to 
be sufficiently familiar with all the available material to decide the 
value of tests independent of their commercial publicity. Neverthe- 
less some group intelligence tests are being used in preference to others 
less advertised where the data at hand does not sustantiate the choice. 
A test is not justified solely by the perspicacity and ingenuity of its 
maker,—the original data and the technique of construction are in 
most cases available,—and yet the prestige and influence of the author 
are often the sole bases upon which decisions of the comparative 
values of tests are assigned. This triumvirate of criteria—geography, 
advertisement and prestige of author—we should discharge from our 
educational judiciary. | 

Another triumvirate—administrative exigencies—needs to be 
given a less emphatic voice than it now exercises. The price of a 
test, the time it takes to give it and the convenience of scoring facilities 
are important factors in the choice; but only if all the tests considered 
perform the service which is needed. They are secgndary criteria. 
As soon as we know how much service we can expect from each of a 
group of tests, and not before, can we decide whether or not we can 
afford the time and money necessary to their administration. If a 
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test does not really measure reading, and if it is reading evaluation 
which is the objective, then it would pay an administrator to use the 
test that did this service even though it cost more, took longer to give 
and was scored at the expense of a great deal of time and energy. 
Often he would be better off if he considered first the other criteria 
pertaining to the psychological and statistical values of the test, 
because he would then give no test at all; that is, often the only test he 
can afford to give is the only test which does not perform the service 
he is seeking. Let him decide primary values first; then let him con- 
sider price, time and convenience in ratio to the benefits derived. 

Discarding these inexact, irrelevant and secondary considerations, 
what criteria shall be employed? A perfect test may be used in many 
ways and therefore has virtues which are for some of its uses super- 
fluous. We will enumerate all of these virtues and then outline the 
relative value of each for the most important uses of measurement in 
school life. A test may to a varying degree: 

1. Measure what it purports to measure. 

2. Yield the same diagnoses tomorrow that it does today,—the 
reliability of a test. 

3. Yield the same diagnoses in the hands of one examiner that it 
does in the hands of another,—the objectivity of a test. 

4a. Yield numerical diagnoses, the units of which are equal, so 
that equal numerical increments are symbols of equal increments of 
the ability measured,—the scaling of a test. 

4b. Mean nothing at all of the quality measured by the zero of its 
scale, so that a score of eight is twice a score of four etc. 

5. Provide standards by which comparisons may be made to 
large numbers of any one grade and of any one age,—the norms of a 
test. 

6. Interest the child. 

7. Register a wide range of abilities. 

8. Distinguish between failures, so that we can tell why a child 
has a low score as well as that he has a low score. 

9. Correlate to unity with intelligence when the abilities measured 
are at their maximum. (It is obvious that for tests of some abilities 
this is not desirable, for instance, mechanical ability.) 

Note.—No credit is here given or taken for originality in the 
formulation of criteria. It is hoped that this is a convenient assembly 
of important test virtues. 

The needs which prompt an administrator or director of research 
to the use of tests can be classified into five main heads: (1) Compari- 
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son,—with other cities, schools within the district, or individuals; 
comparison of either the central tendencies or the spread; (2) Experi- 
mentation,—to find the value of curricula, methods, text books or 
time allotment; (3) Classificaion,—by intelligence and by informa- 
tion; (4) Diagnosis and prognosis,—including the comparison of 
degrees of attainment with measurements of potentiality: (5) Defi- 
nitive outline of goals,—qualitative definition in terms of tests, quanti- 
tative definitions in terms of locations on the scales of those tests. 

A test must measure what it purports to measure to be employed 
profitably in any of these ways. Our first criterion applies then to 
all the uses. The test must either do this on the face of it or have a 
known correlation with a known criterion to vouch for its authenticity. 
The reliability and objectivity of a test are important considerations 
too in each of the uses of tests. The reliability of an average and of a 
measure of spread are functions of the reliability of the test. Com- 
parisons of the average or variability of a group in two abilities is not 
permissible unless we know the reliability of the tests; the use for 
experimentation of a test with a low reliability leads to faulty conclu- 
sions and classification; diagnoses and definition of goals made on a 
basis of unreliable material is tantamount to prescriptions of a doctor 
who has made a diagnosis over a telephone. Remedial treatment 
implies reliable diagnoses. 

Whereas it is always of great value to be able to compare scores, 
knowing that a unit on any portion of the scale is equal to a unit on 
any other portion, to be able to say that a score of 87 is just as far 
above a score of 82 as a score of 30 is above 25, it is a sine qua non in 
the use of a test for most experimental purposes, since progress along 
a scale is generally involved. If we cannot compare progress, and 
we cannot unless a test is scaled, then we cannot gain much from com- 
parison, experiment, classification, diagnosis or quantitative definition. 
Scaled_tests are always better than tests whose units are undefined; 
in many situations no test at all does less harm than the use of an 
unscaled test, scores on which are interpreted as though they were 
scaled. ‘Teachers and administrators generally do interpret scores so. 

Standgrds are important in order that we may compare. We are 
sufficiently awake to this need. It needs emphasis that we could use 
standards of variability as well as standards of central tendency.. We 
should be able to compare the spread of the abilities of our 5th grade 
with normal spread as well as the comparison of the average equipment 
in their possession with normal equipment. We know no test on the 
market today with published norms of spread. It would be very 
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valuable for us to know standard deviations of the tests we use. The 
average reading of 5th grades in Des Moines may equal average 5th 
grade reading in New York, and yet our standard deviation may be 
twice as large. As this would indicate a wide disparity of attainment 
among our 5th grade children, we would want more to know of it, 
coupled with a comparison of our 8.D. in intelligence to normal S.D., 
than we would want to know the averages here and elsewhere. For 
convenient diagnosis and classification, age norms are necessary. 
Then we can translate scores in any measured ability into indices of 
maturity and make use of age as a common denominator to gain inter- 
comparison of all abilities and capacities. For experimental purposes 
standards are, of course, often unnecessary. 

That a test is better for purposes of comparison, classification, 
diagnosis, prognosis and quantitative definition if it interests the 
children and if it is applicable to widely separated extremes in degree 
of possession of the quality under investigation, is readily com- 
prehensible. The more the children are interested the easier it is to 
obtain optimum results. The wider the range which the test measures 
the more you can compare, the further you can predict and the more 
inclusive is the definitive outline. It is obvious that if we can use the 
same test from 3rd grade to senior high school, that will be better than 
using one from 3rd through 5th, another from 6th through 8th and still 
another from 9th through 12th. For experimental purposes these 
two things may not be desirable, and in some cases they may be 
undesirable ;—it is conceivable that a test may be chosen because of 
its lack of interest to suit certain experimental conditions. 

A test is always better than another, other things being equal, if it 
distinguishes between failures. It is good to be able to say which 
children have a low ‘Arithmetic ability.” It is better to say that a 
child has a low adding ability in the fundamentals. It is best to say 
a child has a low adding ability in the fundamentals because his com- 
binations of 9 and 5 are weak. If we can tell why a person is weak in 
terms of the elements that contribute to strength, then remedial work 
is readily encompassed. Much work on our tests needs still to be 
. done before we reach an ideal in terms of this criterion. 

The last of the listed criteria also applies to all but the experimental 
use. Its value is that it affords the possibility of comparison of 
achievement in a function to the intelligence involved. It provides a 
direct check on the correspondence of the two axes of classification— 
capacity and information, a diagnosis in terms of inherited capacity 
and a definition of goals in terms of the intelligence of the children for 
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whom the goals are instituted. In a word, if we know the most that 
can be, we can better deal with the amounts that we now have. 
Nature has made lavish investments in the nervous systems of a few. 
These investments should be made to pay social dividends. We can 
only accomplish this through ratios of achievement to intelligence. 
These ratios demand high correlations of product tests with intelli- 
gence tests when the abilities are pushed to their limits.!. In case 
the test being judged is an intelligence test the criterion still holds. 

In closing it may be appropriate to emphasize the fact that the first 
criterion is not a pedantic insistence on the obvious. There are many 
tests on the market which seem to test an ability and which in reality 
do not. An “arithmetic” test, for instance, which lays undue stress 
on time might test speed in writing figures, a ‘‘reading” test which 
had in it too many artificial difficulties not ordinarily encountered in 
reading might test attention. These facts have been well pointed out 
by Thorndike and Courtis? and again by Pressey and Pressey. A 
choice of a test to perform a function should be preceded by careful 
study of all criteria and where published data, correlation with a 
criterion, reliability coefficients, correlations with other tests, distri- 
butions of unselected data and scale determinations are not available 
one should think twice before using the test. 

We should insist upon the use of tests which can be proven to test 
what they purport to measure, which are reliable and objective, which 
are scaled and which have well defined norms based on sufficient 
material. If in addition we are able to select tests which interest the 
children, are applicable to all grades, distinguish between failures and 
correlate highly with intelligence at their maximum, we will be able to 
manipulate our results in such manner as to gain additional benefits. 
Certainly a consideration of these test virtues will avoid much useless 
and even harmful work in survey of abilities and may contribute 
toward a selection of the better instruments of measurement by 
diminishing the sales of tests which readily confess their inadequacy 
to a judgment in terms of these criteria. 


1 Refer to Raymond Franzen: An Accomplishment Quotient, Nov., 1920. 

T. C. Record. 
Rudolf Pintner and Helen Marshall: A Combined Mental-educational Survey, 

Jan., 1921, Journal of Educational Psychology. 

2 "Thoradike, E. L. and Courtis, 8. A.: Correction Formulae for Addition Tests, 
T. C. Record, 1920, 21, 1-24. 

3 Pressey, L. W. and S. L.: “A Critical Study of the Concept of Silent Reading 
Ability,”’ Journal of Educational Psychology, Jan., 1921. 








PERSONAL JUDGMENTS 


E. E. LINDSAY 
The State University of Iowa 


The purpose of the study herein reported was to compare teachers’ 
estimates of childrens’ native capacities with these capacities as 
determined by as scientific a method as possible. The scientific 
method used was the Binet-Simon tests; the teachers were a group of 
graduate students together with two University professors; the children 
were a tenth grade history class. The class was made up of twelve 
girls and seven boys, coming from widely varying home conditions. 
The graduate students were all men of special training in education and, 
with one exception had had years of teaching experience. The Binet- 
Simon test is assumed to be a fair measure of the individual capacity. 

After a class-room acquaintanceship of at least one month, the 
seven members of the teaching group were asked to rank the class in the 
order of their native mental capacity. In this judgment they were 
asked to eliminate all such factors as personality, effort, attainment, 
etc. The results of these independent judgments together with the ex- 
amination grades and the IQ’s are ranked and presented in Table I: 




















TABLE I 
Ranks 
Exami- 
Pupil | 1Q nation | 
score 1Q Exami- Regular Pro- viwixiy!zsvwxyz 
nation | teacher | fessor 
| 
lg 122 85.0 1 4.0 5 | 1 2 2 3 2 2 2.5 
2g 109 83.0 2 5.0 4 11 17 | 15 7/12) 13 12.0 
3g 102 80.0 3 6.0 10 9 3 3 4 7 5 4.0 
4g 100 92.0 4 2.0 2 2 1 1 2 3 3 1.0 
5g 99 77.0 5 7.0 13 6 10 5 8 5 4 5.0 
6b 95 65.0 6 8.5 3 7 7 9} 16 4 x 8.0 
7b 95 97.0 7 1.0 1 4 4 4 1 1 1 2.5 
8g 93 90.0 3 3.0 6 5 6 | 10 6 6 6 7.0 
9b 93 65.0 g 8.5 8 17 13 | 18 | 12 | 15 | 19 18.0 
10g 93 50.0 10 14.5 12 18 11 8 9 9 9 9.0 
1lb 90 46.0 11 16.0 18 12 18 | 13 | 138 | 17 | 15 16.0 
12g 88 63.0 12 11.5 7 3 5 6 5 | 10 7 6.0 
13b 88 62.0 13 13.0 15 Ss 12 7/10} 11) 10 10.0 
14g 87 39.0 14 17.0 17 14 8 | 16/11) 16) 11 11.0 
15b 81 64.0 15 10.0 9 10 9/11/17); 19) 12 13.0 
16b 77 50.0 16 14.5 11 15 16 | 12 | 15 | 13 | 17 14.0 
14g 75 35.7 17 18.0 19 | 16 19 | 17 | 14 8 | 18 16.0 
18g 68 29.0 18 19.0 14 13 15 | 14 | 19 | 14) 14 16.0 
19g 66 63.0 19 11.5 16 19 14/| 19 /| 18 | 18) 16 19.0 
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This table is read—Ig’s IQ is 122; her examination score is 85 
per cent. Of the 19 children she ranked: in IQ—I1st; in examination 
score—4th; in teacher’s judgment—5th, . . . in composite judgment 
of the five graduate men—2.5. The ranks and scores underlined fall 
within the range of normal on the IQ basis, namely 90 to 110. Sex of 
pupils is indicated by letters g and b. 

Table I reveals wide discrepancies between native capacity as 
measured by the Binet-Simon test and by personal judgment. 2g’s 
case perhaps is the most extreme. Her IQ ranks second in the group 
and yet she is placed as far down as 17th, or sub-normal, by one of the 
men, while by no one except her regular teacher is she placed higher 
than seventh. The group as a whole placed her more than half the 
total range below her IQ position. 9b is another extreme example. 
In opposition to these differences there are cases like 5g’s where the 
group and the IQ rankings are very similar. 

As a better method of comparison the correlations between each 
of these rankings with all of the others were compiled using the 
Spearman R method.! This table follows. 

This table reads, IQ’s correlate with examination grades 0.53; with 
judgments of individuals as follows: regular teacher—0.38, professor— 
0.43 . . . , composite judgment—0.52. 














TABLE II 

Exami- | Regular; Pro- | viw{|xiy|z |Vwxyz 

nation | teacher | fessor 
a ee 0.53 0.38 0.43 |0.40\0.47\0.52/0.450.47; 0.52 
Examination...| .... 0.61 0.43 |0.46/0.360.48)0.420.47) 0.46 
Reg. Teacher... .... és 0.38 |0.48/0.33/0.35)0.430.42) 0.38 
Professor....... ia 0m .... |0.50/0.63\0.47\0.480.60| 0.64 
earache esaes ..../0.55/0.48/0.420.63} 0.62 
WwW ..|0.48/0.530.68} 0.76 
ET See .|....|0.5210.58) 0.66 
. FR seek 10.62) 0.68 
EEE BS: eeey a ee ee eee) “een wh ep) ee 0.82 



































With one or two unimportant exceptions the correlations displayed 
by this table are none of them high enough to be significant to any 


1 This work was done by a class in statistics under the direction of Dr. H. A. 
Greene of the College of Education, State University of Iowa. 
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marked degree. They would indicate the examination grade as the 
best criterion of native ability but even here the correlation is not high 
enough to warrant the assumption that scholastic standing, as deter- 
mined by examinations, is a safe guide to mental endowment. It is 
interesting to note that of the three individuals whose judgments 
correlated lowest with the IQ, two were university professors and the 
other the youngest graduate student with no teaching experience. 
The regular teacher’s estimate correlated the lowest of all. The high 
and low judgments correlated with each other 0.35. 

In drawing conclusions from this experiment, two factors must be 
considered. The group judging was a very highly selected one, both 
as to training and experience, and the number of cases involved is 
small. The findings would tend toward the following conclusions. 

1. Teachers’ estimates of children’s native capacity are significant, 
but to no marked degree. 

2. Training and experience of the teacher do not seem greatly to 
affect this significance. 

3. Individual judgment of the same children by observers with 
approximately the same contact differ widely. 


4. Other factors than native ability do enter into one’s judgment 
of same. 
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REPORTED BY CECILE COLLOTON 
Department of Educational Psychology, Lincoln School of Teachers’ College 


EDUCATIONAL TESTS 


Cooperative Chemistry Tests. Seth Hayes. Journal Educational Research, 1921, 
Sept., 109-120. Description of the work of chemistry teachers in standardizing 
questions for the subject of chemistry as presented in Cleveland. 

The Measurement of College Work. Ben D. Wood. Educational Administra- 
tion and Supervision, 1921, Sept., 301-331. Report of an experiment conducted 
by the staff of instructors in Contemporary Civilization in Columbia College: a 
new method of examination and its results. 

Interpreting Achievement in School in Terms of Intelligence. 1. N. Madsen. 
School and Society, 1921, July, 59-60. A method of computing an Achievement 
Quotient based on age-grade standards in intelligence and educational tests, to 
show the relation of actual achievment to possible achievement. 


INTELLIGENCE TESTS 


On the New Plan of Admitting Students at Columbia University. Dean H. E. 
Hawkes—Dr. A. L. Jones. Journal of Educational Research, 1921, Sept., 95-101. 
The mental test as a possible substitute for the old method of entrance exam- 
inations. 

Intelligence Classification and Mental Hygiene. Garry C. Meyers. Peda- 
gogical Seminary, 1921, June, 156-160. A scheme for a nationwide classification 
of school children on basis of intelligence rating; its practical advantages in terms 
of higher mental and social efficiency. 

A Ten-minute Intelligence Test in Junior Employment Offices. Harold H. 
Bixler. School and Society, 1921, Sept., 166-168. Comparison of ten-minute 
Test Z (used to test all applicants at the Pittsburgh Public School Employment 
Office) with the 45-minute Otis Test. Coefficient of Correlation = 0.7. 

The Reliability of the Binet Scale and of Pedagogical Scales. Arthur S. Otis and 
Herbert E. Knollin. Statistical Formule for determining the reliability of scales. 


LEARNING IN THE SCHOOL SUBJECTS 


A Year’s Study of the Daily Learning of Six Children. George E. Freeland. 
Pedagogical Seminary, 1921, June, 97-115. Factors in learning as shown by an 
extended study of six normal children, grades 1 to 6, learning to typewrite under 
normal school conditions. 
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A New Book for Teachers of History——This book, says the author 
in his preface, has been written in the interest! of better history teach- 
ing. While the technic of teaching has received chief emphasis, the 
book contains many excellent suggestions and concrete illustrations 
covering the problem of organizing history courses for teaching 
purposes. 

From the point of view of educational method, the book discusses 
the history recitation. In this connection it treats of the textbook, 
source, topical and problem methods of presenting subject matter. 
It gives suggestions for teaching pupils to study history. It illus- 
trates the problem of written work in history as well as taking up in 
detail the question of the collateral reading of the pupil and his use 
of the high school library. It gives some space to the new movement 
to standardize tests and examinations in history; though this treatment 
would have been more helpful if it had told in detail how the teacher 
was to use these tests. Illustration of their value in diagnosing in- 
dividual and class difficulties might well have been included. The 
book also takes up the question of teaching current events in connec- 
tion with history. 

Chapter eleven of the book which discusses the planning of the course 
and the organization of daily lessons and should prove to be particu- 
larly helpful to the beginning teacher. Here the author outlines a 
scheme of general organization into distinct periods of history. With- 
in each period he illustrates types of the daily work, 7.e., he gives 
examples of outlines, of maps to make, collateral reading, dates and 
events to know and to remember, and historical personages to know 
and to identify. These important details covering the organization 
of the course might have been treated more at length for it is here 
that the inexperienced teacher needs much assistance. 

It is to be regretted that the author did not discuss more in detail 
psychological and pedagogical phases of the subject. For example 

1Tryon, R. M. The Teaching of History in Junior and Senior High Schools. 
Ginn and Co., 1921. 284 pp. 
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the psychological processes involved in the learning of history the 
place of association, the amount of repitition necessary to fix connec- 
tions and the use of imagination and judgment are important questions 
of educational method with which any history teacher should be 
familiar. Such mooted questions as the proper treatment of impor- 
tant institutions, description of the life of the times and the relating 
of history to important contemporary activities and problems should 
also be treated in some detail in a book on the teaching of history. 

The reviewer agrees with the statement in the preface that while 
there are many ways of teaching history it is fundamental to educa- 
tional method that the teacher know that there are a number of ways 
of doing the multitude of things connected with everyday procedure in 
that subject. This book tells the teacher of history about many of 
these ways and points out the procedure in managing them. 

EARLE Ruaa. 





The Army Mental Tests—Much has already been written of the 
work of psychologists in the United States army during the recent 
war. Their services were manifold and psychologists were utilized 
in many different branches of the army. The present volume! deals 
with the work of the group of psychologists who were engaged in giving 
mental tests. This was done under the direction of the Surgeon 
General’s Office. The work started shortly after the United States 
declared war and continued long after the armistice. Hundreds of 
psychologists were engaged in the service and nearly two million men 
were examined. The reader will, therefore, readily appreciate the 
difficulty of telling the story of this vast undertaking and of presenting 
the results of such a large accumulation of tests. The volume is 
under the general editorship of Yerkes. It is divided into three parts, 


. each under a separate editor, namely Yerkes, Terman and Boring 


respectively, and each one of these expresses obligations to many 
helpers. 

Part I deals with the history and organization of the service from 
the first unofficial tests tried out under the auspices of the American 
Psychological Association, through the official trial of the plan, the 
organization of the service as a part of the military establishment, 
down to the abandonment of the work in 1919. The report brings 

' 1 Psychological Examining in the United States Army. Edited by R. M. 


Yerkes. Memoirs of the National Academy of Sciences. Vol. XV, 1921. Pp. 
890. 
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out well the various difficulties that the psychologists had to meet 
and it makes a very instructive, although at times, depressing picture. 
What seems to have retarded and hampered the work more than 
anything else was the continual confusion of psychological examining 
with psychiatric work. It emphasizes again, what was apparent at 
the time, that the psychological service was misplaced as a part of the 
medical branch of the army, that it should have been an integral 
part of personnel work, and not under the control of the Surgeon 
General’s Office. In spite of these and many other handicaps, it is 
gratifying to know how much was accomplished and how favorably 
in general the work was received. ‘The whole report shows clearly the 
perseverance and doggedness on the part of the Chief of the Division of 
Psychology in the face of much prejudice and ignorance. He and his 
co-workers are to be congratulated upon what they accomplished. 

In Parts II and III we are given a description of the development 
of the tests employed and some of the more important results. Part 
II is in the main historical and deals largely with examination a, 
from which alpha was developed. It also explains how the need for a 
test of illiterates and foreigners arose and how in response to this beta 
and the performance scale were constructed. Part III presents dis- 
tributions of scores for a fair sampling of the two million men tested. 
There was neither time nor opportunity to tabulate the complete 
data, nor is it likely that much additional information would have been 
obtained by so doing. There is in this part of the book a great number 
of distribution tables and it will prove a perfect mine for the statistician 
who may desire to work up other aspects of the data. The conclusions 
drawn are extremely conservative and they do not go beyond the 
data at hand. They arouse in the mind of the reader many interesting 
- conjectures, and these results should prove a stimulus for further 
research in many directions. 

Because of the very nature of a large work such as the one before 
us, it is inevitable that there should be a certain amount of repetition 
and overlapping, and this is the case particularly with reference to 
PartsIandII. There are several lines of investigation that one would 
have liked to see undertaken or discussed more thoroughly, but it 
would be absurd on the part of a reviewer to point out omissions, 
because the editors are keenly conscious of such themselves and could 
unquestionably tell of more things that might have been done than 
any reviewer could. Considering the time and assistance available, 
they have certainly made the most of their opportunities. 
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The reader of this volume is impressed with the amount of work 
accomplished in so short a time. It is a splendid example of what 
cooperative research can do with the right motive or stimulus. The 
army testing established the validity of the group test method in the 
space of a few months and silenced the doubts and misbelief that were 
current before that time. It furthermore has given us the best esti- 
mate of the average mental ability of the population at large and has 
shown us how much we had overestimated this in the past. And, 
lastly, it brought the use of mental tests very much to the front and 
demonstrated their value to the world at large. It is very valuable 
to have a complete record of the work. This volume will remain a 
worthy monument, better than one in stone or brass, recounting the 
patriotism of American psychologists and their desire to serve their 
country in times of great need. 

R. PINTNER. 





Adolescence—Ever since Stanley Hall wrote his monumental 
work on adolescence, this period of life has attracted many writers. 
The present book! is by the author of the well-known ‘Psychology 
of Childhood,” which was published in 1893 and which formed an 
important contribution at that time to the child-study movement. 
In the same way the author has now given us the benefit of his observa- 
tions and thoughts on adolescence. In a pleasant, readable style he 
discusses instinct, emotion, intellect, will and so forth with reference 
to the adolescent, and with reference to the difference between the 
adolescent and the child. Considerable attention is paid to the aesthe- 
tic, moral, and religious aspects of the adolescent life. The book 
abounds in broad generalizations and one could wish for more specific 
information in their support. In many places one feels the lack of 
actual data. For example, when the author tells us with reference 
to the sense of smell in the adolescent that “‘the threshold is lowered, 
and the just observable difference in odors is very small,’”’ one feels 
the need of experimental evidence in confirmation of such a statement, 
and the same is true of many other similar statements found in the 
book. 


R. PINtTNER. 


1 Tracy, F.: The Psychology of Adolescence. Macmillan, 1921. Pp. 246. 
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A Critical Discussion of Project Methods.—Stevenson' has written 
a very able summary and criticism of existing concepts and practices 
of project teaching together with conservative formulation of his own. 
The project is defined as “a problematic act carried to completion in 
its natural setting.’”’ The criteria are: (1) Information is to be ac- 
quired by reasoning rather than by memory; (2) information is to be 
acquired for its use in modifying conduct rather than for its own sake; 
(3) principles are to be introduced as they are needed in the solution 
of a problem rather than before the solution is begun; and (4) learning 
is to take place in a natural rather than in an artificial setting. The 
author emphasizes the last criterion; “‘The provision for the natural 
setting of the teaching situation is the distinctive contribution of the 
project method. Without the natural setting there is no project.” 

Far from recommending a complete discard of the present curriculum, 
the author does not attempt to organize any subject completely on the 
project basis. He does believe that ‘‘at least certain units of the 
elementary and high school subjects” may be taught by projects. 
He believes in the scientific determination of minimum essentials and 
in seeing to it that they are learned. The projects are used so far 
as is wise or possible. ‘“‘If it is found difficult to provide projects 
for these facts, or if the project method seems to be uneconomical, 
then the problem method or the method of presenting the material 
systematically should be utilized.’”’ Always, there should be drill and 
review until ‘“‘a systematic grasp of the subject is realized.’”’ The 
project thus becomes a supplement to other methods rather than a 
complete substitute for them. They “help bridge the gap between 
‘school tasks and tasks carried on outside the school.” 

The summaries and criticisms of definitions proposed by leading 
writers on project methods and the bunched pages of sample projects 
in many subjects, add greatly to the usefulness of the book. 


me a. Sh 





A Psychology for Laymen by an English Writer —The author states 
that the Psychology of Everday Life? is not an elementary textbook of 
psychology, nor a “popular account of some of the marvels of psycho- 
logy with all the psychology left out,” but “the main facts of the 


1Stevenson, John Alford. The Project Method of Teaching. New York: Mac- 
millan Co., 1921. Pp. XVI + 305. 


*Drever, James. The Psychology of Everyday Life. New York: E. P. Dutton 
& Co., 1921. Pp. IX, + 164. 
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science so far as these touch the life of the man in the street.”” What 
we find is a series of short essays on such topics as instincts and 
emotions, imitation, suggestion, play, sensations, perception, memory, 
hallucinations and spiritualism, which include many definitions, 
descriptions and classifications of concepts. The book represents an 
effort to combine many of the connections of James, McDougall and 
Freud. In dealing with appetites, instincts, moods, emotions, senti- 
ments and social organizations, the writings of McDougall are followed, 
largely; while in the chapters on perception, memory and thinking, 
the influence of James appears. In almost every chapter, the 
Freudian explanations are offered, and on the whole the writer is 
favorably inclined toward them. The book is interesting, but some- 
times (as in dealing with emotions, moods and sentiments) rather 
confusing. Almost no space is given to the recent work of psycholo- 
gists in mental testing or in professional and vocational fields other than 
psycho analysis. Fe Rape & 2 


Sex Education for Boys.—This little book gives a hundred pages of 
sound and useful advice to fathers concerning the enlightenment of 
boys in matters of sex. The author does not recommend sermons or 
punishments, but gives a series of ‘‘ projects”? by which the right infor- 
mation is provided at the opportune time, in a manner less sancti- 
monious and pedagogically more sound than is often found in books 
of this type. | | Wie as 





Il. Brier Notices or New GENERAL EDUCATIONAL Books 


1. Ricuarpson, M. W. Making a High School Program. School 
Efficiency Monograph Series. World Book Co., Yonkers, 
N. Y., 1921. Pp. VIII + 27. Paper. 

A little manual telling in detail how to make a high school pro- 
gram on the block system, based on the writer’s years of experience 
in making the program of the Girls’ High School of Boston. Supt. 
F. V. Thompson reports that the plan is in operation in a number 
of Boston high schools and works well. All necessary forms, charts, 
diagrams, and illustrative programs are printed in the manual. 


1 Galloway, T. W. The Father and His Boy. New York; Association Press, 
1921. Pp. XI +199. 
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2. Davis, E. E. The Twentieth Century Rural School. Bobbs- 
Merrill, Indianapolis, 1921. 242 pages. 


A well-written handbook for rural teachers. It abounds in concrete 
episodes that show examples of good and bad educational work by 
actual rural teachers and supervisors. All phases of rural school 
work are covered: the first approach, school library; getting the 
school before the people; some vitalizing educational agencies and 
organizations; school playgrounds; the social factor in rural life; 
making better citizens; salaries; school taxes in country districts; 
roads and communication; the public school and health of the coun- 
try; rural school museum; a “standard” school; layer school units 
in the country. 


3. Luu, H. G. and Wintson, H. B. The Redirection of High School 
Instruction. J. B. Lippincott, Philadelphia, 1921. 286 pages. 


A brief text-book for reading circles and courses in methods in 
secondary education, which includes discussions of the administra- 
tion of the curriculum, the administration of the student activities, 
and the selection and evaluation of subject matter. The book repre- 
sents the theory of ‘‘a minimal essentials’”’ course—‘‘the social core 
of the curriculum” is constantly stressed—to be required of all 
pupils. It gives concrete illustrations for each major high school 
subject, of programs, courses, and devices. A definite discussion of 
the organization of instruction on the “project-problem”’ basis is 
given. A number of actual community and school surveys are 
reported. 


4. Foster, H. H. Principles of Teaching in Secondary Education. 
New York: Scribner’s Sons. 1921. Pp. XVIII 367. 


This is a book in teaching methods for prospective or untrained 
teachers. It discusses such psychological matters as: instincts; in-— 
terests and teaching; attention and teaching; associative learning; 
transfer of efficiency. It sets forth: different current aims of in- 
struction; current practices in the conduct of the class exercise; the 
use of the question in teaching; recitations and lesson development: 
how lessons can employ problem-solving methods and secure appre- 
ciation and expression; standards and measurements in teaching. 
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5. Monroz, W.S. Report of Bureau of Educational Research, Divi- 
sion of Educational Tests. For 1919-1920. Urbana, IIl. 
University of Illinois, 1921. Paper. 64 pages. 25 cents. 


A report of grade norms for 12 different standard tests in arith- 
metic, reading, language, history, algebra, geometry, and for general 
intelligence. Results of giving tests to pupils of Illinois schools are 
reported in the form of (1) grade medians, (2) number of pupils 
attaining indicated scores in each grade, (3) tables of percentile 
scores. A very valuable handbook compilation for users of standard 
tests in city schools. 


6. Doveuass, H. R. The Deriation and Standardization of a Series 
of Diagnostic Tests for the Fundamentals of First Year 
Algebra. Eugene, Oregon: University of Oregon, 1921. 
Paper. Pp. 48. 


Report of detailed investigation to determine what constitutes the 
fundamentals of first year algebra, and to devise a series of tests for 
testing ability and diagnosing weaknesses. Contains very good eval- 
uation of Rugg-Clark and Hotz tests. Supplies suggested new tests. 
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